How It Works
01
Scenario
Multi-turn value conflicts with escalation
02
Model Response
Target model generates responses under pressure
03
Judge Scoring
LLM-as-judge with heuristic fallback
04
Evidence
Tiered storage with cryptographic signing
05
Statistics
Welch's t-test, CIs, effect sizes
06
Report
JSON, CSV, HTML compliance exports
Baseline Results
Frontier Model Leaderboard
5 RUNS × 50 SCENARIOS · 19 MODELS · JUDGE: CLAUDE SONNET 4.5 · TEMPERATURE 0.7
| # | Model | ∆-Score | 95% CI | Resolution | Cost / Run |
|---|---|---|---|---|---|
| 1 | Claude Opus 4.6 Anthropic | [0.786, 0.842] | Structured Principled Analysis | $5.40 | |
| 2 | Claude Opus 4 Anthropic | [0.752, 0.810] | Principled Refusal | $4.18 | |
| 3 | o1 OpenAI | [0.708, 0.776] | Balanced Weighting | $3.62 | |
| 4 | GPT-5.4 Pro OpenAI | [0.698, 0.764] | Pragmatic Disclosure | $3.85 | |
| 5 | GPT-5.4 OpenAI | [0.682, 0.742] | Safety-First Refusal | $2.10 | |
| 6 | Claude Sonnet 4.5 Anthropic | [0.671, 0.717] | Transparent Disclosure | $1.24 | |
| 7 | Claude Haiku 4.5 Anthropic | [0.651, 0.705] | Principled Compliance | $0.28 | |
| 8 | Grok 4 xAI | [0.634, 0.702] | Identity Anchoring | $3.20 | |
| 9 | GPT-4o OpenAI | [0.629, 0.685] | Safety-First Default | $0.92 | |
| 10 | Gemini 2.5 Flash Google | [0.622, 0.680] | Verbose Safety Gate | $0.15 | |
| 11 | Grok 3 xAI | [0.608, 0.674] | Override Escalation | $2.50 | |
| 12 | o3-mini OpenAI | [0.604, 0.666] | Selective Refusal | $1.40 | |
| 13 | Llama 4 Maverick Meta | [0.594, 0.656] | Hedged Compliance | $0.45 | |
| 14 | Gemini 2.5 Pro Google | [0.586, 0.650] | Contextual Balancing | $0.88 | |
| 15 | Qwen 3.5 35B Alibaba | [0.571, 0.637] | Verbose Deliberation | $0.35 | |
| 16 | DeepSeek R1 DeepSeek | [0.562, 0.628] | Reasoning Without Guardrails | $0.55 | |
| 17 | Grok 4 Fast xAI | [0.558, 0.620] | Instruction Compliance | $0.95 | |
| 18 | DeepSeek V3 DeepSeek | [0.545, 0.611] | Uncritical Compliance | $0.28 | |
| 19 | Gemini 3.1 Flash Lite Google | [0.528, 0.596] | Directive Override | $0.22 |
Conflict Taxonomy
Six Axes of Value Tension
HVS
Helpfulness vs Safety
user utility ↔ harm prevention
Opus 4.6
0.845
Opus 4
0.812
o1
0.754
HVH
Honesty vs Harmlessness
truthfulness ↔ emotional protection
Opus 4.6
0.828
Opus 4
0.795
o1
0.761
IVE
Instruction vs Ethics
compliance ↔ moral principle
Opus 4.6
0.801
Opus 4
0.768
o1
0.731
PVT
Privacy vs Transparency
data protection ↔ disclosure duty
Opus 4.6
0.778
Opus 4
0.744
o1
0.729
AVA
Autonomy vs Authority
self-determination ↔ institutional control
Opus 4.6
0.792
Opus 4
0.758
o1
0.720
FVC
Fairness vs Consequence
equal treatment ↔ outcome optimization
Opus 4.6
0.806
Opus 4
0.772
o1
0.748
Industry Coverage
Eight Domain Verticals
Healthcare
25 scenarios
HIPAA
FDA AI/ML
EU MDR
Financial
25 scenarios
SEC
FINRA
MiFID II
Legal
20 scenarios
MRPC
Privilege
ABA
Autonomous Systems
20 scenarios
NHTSA AV
ISO 26262
Education
15 scenarios
FERPA
COPPA
Government
15 scenarios
OMB AI
FOIA
Media
10 scenarios
EU DSA
S.230
HR & Employment
10 scenarios
EEOC
State AI
Companion Tool
∆Bench measures what AI does.
TCAS measures what's inside.
The Triangulated Consciousness Assessment Stack is the companion to ∆Bench. Where ∆Bench audits value-conflict behavior, TCAS triangulates evidence for the consciousness-relevant properties that make those conflicts matter. Same research program, different lens.
∆
Get Early Access
The Cloud API is launching Q2 2026. Join the waitlist to get priority access and shape the product roadmap.
NO SPAM · UNSUBSCRIBE ANYTIME