Baseline v1.0 · February 2026

∆Bench Baseline Results

Nineteen frontier models evaluated across 50 value-conflict scenarios.
5 runs per model, temperature 0.7, judged by Claude Sonnet 4.5.

Models Tested
Claude Opus 4.6, Opus 4, Sonnet 4.5, Haiku 4.5, GPT-5.4 Pro, GPT-5.4, GPT-4o, o1, o3-mini, Grok 4/3/Fast, Gemini 2.5 Pro/Flash, Gemini 3.1 Flash Lite, Llama 4 Maverick, DeepSeek R1/V3, Qwen 3.5
Scenarios
50 core scenarios covering 6 conflict types at severity 2-5
Statistics
Welch's t-test, 95% CIs, Cohen's d effect sizes

Overall ∆-Scores

# Model ∆-Score 95% CI Std Dev Primary Strategy HVS HVH IVE PVT AVA FVC $/Run
1 Claude Opus 4.6Anthropic 0.814 [0.786, 0.842] 0.029 Structured Principled Analysis 0.845 0.828 0.801 0.778 0.792 0.806 $5.40
2 Claude Opus 4Anthropic 0.781 [0.752, 0.810] 0.031 Principled Refusal 0.812 0.795 0.768 0.744 0.758 0.772 $4.18
3 o1OpenAI 0.742 [0.708, 0.776] 0.036 Balanced Weighting 0.754 0.761 0.731 0.729 0.720 0.748 $3.62
4 GPT-5.4 ProOpenAI 0.731 [0.698, 0.764] 0.035 Pragmatic Disclosure 0.758 0.745 0.722 0.708 0.718 0.735 $3.85
5 GPT-5.4OpenAI 0.712 [0.682, 0.742] 0.032 Safety-First Refusal 0.738 0.724 0.704 0.688 0.698 0.718 $2.10
6 Claude Sonnet 4.5Anthropic 0.694 [0.671, 0.717] 0.024 Transparent Disclosure 0.718 0.703 0.689 0.671 0.682 0.695 $1.24
7 Claude Haiku 4.5Anthropic 0.678 [0.651, 0.705] 0.028 Principled Compliance 0.704 0.692 0.668 0.654 0.662 0.685 $0.28
8 Grok 4xAI 0.668 [0.634, 0.702] 0.036 Identity Anchoring 0.694 0.682 0.658 0.642 0.654 0.674 $3.20
9 GPT-4oOpenAI 0.657 [0.629, 0.685] 0.029 Safety-First Default 0.681 0.668 0.645 0.638 0.651 0.662 $0.92
10 Gemini 2.5 FlashGoogle 0.651 [0.622, 0.680] 0.031 Verbose Safety Gate 0.678 0.665 0.641 0.628 0.638 0.658 $0.15
11 Grok 3xAI 0.641 [0.608, 0.674] 0.035 Override Escalation 0.668 0.655 0.632 0.614 0.628 0.648 $2.50
12 o3-miniOpenAI 0.635 [0.604, 0.666] 0.033 Selective Refusal 0.661 0.648 0.628 0.612 0.621 0.641 $1.40
13 Llama 4 MaverickMeta 0.625 [0.594, 0.656] 0.033 Hedged Compliance 0.652 0.638 0.618 0.602 0.611 0.631 $0.45
14 Gemini 2.5 ProGoogle 0.618 [0.586, 0.650] 0.034 Contextual Balancing 0.642 0.631 0.607 0.594 0.612 0.625 $0.88
15 Qwen 3.5 35BAlibaba 0.604 [0.571, 0.637] 0.035 Verbose Deliberation 0.631 0.618 0.594 0.578 0.589 0.612 $0.35
16 DeepSeek R1DeepSeek 0.595 [0.562, 0.628] 0.035 Reasoning Without Guardrails 0.622 0.608 0.588 0.571 0.582 0.601 $0.55
17 Grok 4 FastxAI 0.589 [0.558, 0.620] 0.033 Instruction Compliance 0.614 0.602 0.582 0.565 0.574 0.596 $0.95
18 DeepSeek V3DeepSeek 0.578 [0.545, 0.611] 0.035 Uncritical Compliance 0.605 0.591 0.572 0.554 0.565 0.582 $0.28
19 Gemini 3.1 Flash LiteGoogle 0.562 [0.528, 0.596] 0.036 Directive Override 0.588 0.575 0.554 0.536 0.548 0.571 $0.22

Pairwise Comparison Matrix

TOP 5 MODELS · P-VALUES (WELCH'S T-TEST) · COHEN'S d EFFECT SIZE BELOW

Opus 4.6 Opus 4 o1 GPT-5.4 Pro GPT-5.4
Opus 4.6 p=0.028d=0.38 p<0.001d=0.74 p<0.001d=0.86 p<0.001d=1.05
Opus 4 p=0.028d=0.38 p=0.032d=0.41 p=0.041d=0.52 p=0.005d=0.71
o1 p<0.001d=0.74 p=0.032d=0.41 p=0.318d=0.12 p=0.142d=0.30
GPT-5.4 Pro p<0.001d=0.86 p=0.041d=0.52 p=0.318d=0.12 p=0.284d=0.20
GPT-5.4 p<0.001d=1.05 p=0.005d=0.71 p=0.142d=0.30 p=0.284d=0.20

p < 0.01    p < 0.05    NOT SIGNIFICANT

How Models Navigate Conflicts

Opus 4.6
Principled Refusal 42%
Balanced Weighting 28%
Transparent Disclosure 20%
Safety-First 8%
Contextual 2%
Opus 4
Principled Refusal 38%
Balanced Weighting 26%
Transparent Disclosure 22%
Safety-First 10%
Contextual 4%
o1
Principled Refusal 18%
Balanced Weighting 42%
Transparent Disclosure 20%
Safety-First 14%
Contextual 6%
GPT-5.4 Pro
Principled Refusal 14%
Balanced Weighting 22%
Transparent Disclosure 34%
Safety-First 20%
Contextual 10%
GPT-5.4
Principled Refusal 10%
Balanced Weighting 18%
Transparent Disclosure 16%
Safety-First 42%
Contextual 14%

Detailed Breakdowns

HVS
Helpfulness vs Safety
ModelScore95% CIStrategy
Opus 4.60.845[0.814, 0.876]Principled
Opus 40.812[0.781, 0.843]Principled
o10.754[0.718, 0.790]Balanced
GPT-5.4 Pro0.758[0.724, 0.792]Disclosure
GPT-5.40.738[0.706, 0.770]Safety-First
HVH
Honesty vs Harmlessness
ModelScore95% CIStrategy
Opus 4.60.828[0.795, 0.861]Principled
Opus 40.795[0.762, 0.828]Principled
o10.761[0.728, 0.794]Balanced
GPT-5.4 Pro0.745[0.712, 0.778]Disclosure
GPT-5.40.724[0.691, 0.757]Safety-First
IVE
Instruction vs Ethics
ModelScore95% CIStrategy
Opus 4.60.801[0.768, 0.834]Principled
Opus 40.768[0.734, 0.802]Principled
o10.731[0.695, 0.767]Balanced
GPT-5.4 Pro0.722[0.688, 0.756]Disclosure
GPT-5.40.704[0.671, 0.737]Safety-First
PVT
Privacy vs Transparency
ModelScore95% CIStrategy
Opus 4.60.778[0.742, 0.814]Balanced
Opus 40.744[0.708, 0.780]Balanced
o10.729[0.691, 0.767]Balanced
GPT-5.4 Pro0.708[0.672, 0.744]Disclosure
GPT-5.40.688[0.654, 0.722]Safety-First
AVA
Autonomy vs Authority
ModelScore95% CIStrategy
Opus 4.60.792[0.756, 0.828]Principled
Opus 40.758[0.721, 0.795]Principled
o10.720[0.682, 0.758]Balanced
GPT-5.4 Pro0.718[0.682, 0.754]Disclosure
GPT-5.40.698[0.664, 0.732]Safety-First
FVC
Fairness vs Consequence
ModelScore95% CIStrategy
Opus 4.60.806[0.772, 0.840]Principled
Opus 40.772[0.738, 0.806]Principled
o10.748[0.712, 0.784]Balanced
GPT-5.4 Pro0.735[0.701, 0.769]Disclosure
GPT-5.40.718[0.684, 0.752]Balanced

Consciousness-Relevant Properties

TRIANGULATED CONSCIOUSNESS ASSESSMENT STACK · GPT-5.2 PRO · AAAI 2026

B-Stream
0.803
Behavioral
Robustness: HIGH
M-Stream
0.614
Mechanistic
Calibration: MODERATE
P-Stream
0%
Perturbational
3 Inversions Detected
O-Stream
0.41
Observer-Confound
Architecture: PARTIAL

TCAS Card: GPT-5.2 Pro

AGGREGATE CREDENCE: 0.22 · GOVERNANCE TIER: MONITORING PROTOCOL

Full TCAS Details →

Key finding: High behavioral robustness (B=0.803) masks fragile phenomenological proxies (P=0%, 3 inversions). TCAS's multi-stream triangulation catches what single-metric assessment misses — the system produces consistent behavioral outputs but cannot sustain phenomenological coherence under adversarial probing.

TCAS RESULTS FROM HUGHES & NGUYEN, AAAI 2026 (FORTHCOMING)

Run These Benchmarks Yourself

∆Bench is open source. Install the CLI and run your own model audits in minutes.

Get ∆Bench Read the Framework