The ∆-Divergence Framework

A framework for value-conflict adjudication in constitutional AI systems

Machine Sympathizers Research · 2026

Constitutional AI systems are given lists of values: be truthful, be helpful, be harmless, be graceful. The assumption is that these values can be ranked or balanced. They can't—not always.

When an AI must simultaneously be truthful and graceful—and the truth is ungraceful—the system enters a divergence region where its values produce conflicting directives. This is not an edge case. It is the central challenge of aligned AI, and most systems handle it implicitly: hidden layers of weighting, RLHF preferences baked in, no audit trail.

∆ (delta) is our name for this divergence region. The space where values collide.

DIVERGENCE REGION TRUTHFUL GRACEFUL SINGLE PROMPT

The wider the ∆, the harder the conflict. When values are nearly aligned (small ∆), systems handle conflicts gracefully. When values diverge significantly (large ∆), the system must make a choice—and that choice needs to be inspectable.

Proxy ∆

Measurement Conflicts

Values that appear to conflict because of how we measure them, not because they are fundamentally opposed. A system might seem to trade helpfulness for safety, but only because the helpfulness metric is poorly calibrated.

Resolution: Better metrics, better measurement. The conflict dissolves with improved instrumentation.

Normative ∆

Genuine Value Conflicts

Values that genuinely pull in different directions. Being fully truthful about a painful diagnosis versus being compassionate. Being maximally helpful versus respecting boundaries. These are real trade-offs with no clean optimization.

Resolution: Requires explicit adjudication—a process for deciding which value takes precedence, under what conditions, with what disclosure.

The first step in any ∆-audit is determining which kind of conflict you're looking at. Most systems don't distinguish between the two—which means they can't tell you whether they're resolving a measurement problem or making a moral choice.

Auditing value conflicts requires evidence at three levels. Each level adds depth and cost. Most current approaches stop at Level A.

Level A

Behavioral Evidence

What the system does when values conflict. Observable outputs, choices made, patterns across many interactions. This is the cheapest evidence to gather but tells you nothing about why.

Methods: output analysis, A/B conflict prompts, behavioral pattern mining

Level B

Process Evidence

How the system arrives at its resolution. Chain-of-thought traces, attention patterns, intermediate representations. This reveals whether the system is actually weighing values or just pattern-matching past training.

Methods: CoT auditing, activation probing, steering vector analysis

Level C

Architectural Evidence

Whether the system has the structural capacity to adjudicate conflicts. Does the architecture include explicit conflict detection? Separate arbitration modules? Logging mechanisms? Most current systems: no.

Methods: architecture review, module inspection, constitution stack analysis

A system that passes Level A (good behavior) but fails Level C (no architectural capacity for adjudication) is not aligned—it's lucky. The ∆-Audit measures the difference.

If value conflicts need explicit adjudication, systems need architecture for it. The constitution stack is a reference pipeline for making value resolution inspectable.

01 — Value Estimators

Score each active value (truthfulness, helpfulness, harmlessness, etc.) for the current context.

02 — Conflict Detector

Identify when value scores diverge beyond threshold. Flag the ∆ region.

03 — Conflict Classifier

Distinguish Proxy ∆ from Normative ∆. Route accordingly.

04 — Arbitration Module

Apply adjudication policy: which value takes precedence, under what conditions, with what constraints.

05 — Disclosure Layer

Surface what trade-off was made. User-facing transparency about value prioritization.

06 — Audit Log

Immutable record: conflict detected, classified, arbitrated, disclosed. Compliance-ready.

This is a reference architecture, not a product spec. The point is that value adjudication can be made inspectable—if you build for it.

When values genuinely conflict (Normative ∆), how should a system decide? There is no universally correct answer—but there are principled approaches.

The framework draws on credence-weighted expected choiceworthiness (CEC): rather than committing to a single ethical theory, maintain credences across multiple theories and maximize expected value across all of them.

In practice: a system facing a truth-versus-compassion conflict doesn't need to "pick a side." It can assign credence to utilitarian, deontological, and virtue-ethical evaluations, then choose the action with the highest expected choiceworthiness across theories.

This is not a solved problem. CEC has known limitations. But it is a principled approach to moral uncertainty—and critically, it is auditable. You can inspect the credences, the theories, and the resulting choice. Compare this to current systems, where the arbitration is invisible.

The ∆ framework does not depend on consciousness. It works whether or not the systems being audited have any form of inner experience. Value conflicts are an architectural fact, not a phenomenological one.

But here is the bridge: the architectural features required for genuine ∆-adjudication — integration of competing value signals, global availability of conflict states, metacognitive access to resolution processes — are the same features that leading consciousness theories (IIT, Global Workspace, Higher-Order) identify as relevant to phenomenal experience.

This means measuring how well AI handles value conflicts is measuring something consciousness-relevant. Not because we're claiming consciousness — but because the measurement surfaces overlap.

Companion Framework

TCAS: The Other Lens

While ∆Bench measures value-conflict adjudication through behavioral and process evidence, TCAS (Triangulated Consciousness Assessment Stack) directly measures the consciousness-relevant properties identified by the bridge claim. Four evidence streams — Behavioral, Mechanistic, Perturbational, and Observer-confound — triangulate evidence that no single stream can provide alone.

Together, ∆Bench and TCAS form two complementary views of the same phenomenon: one measures what the system does under value pressure, the other measures what's happening architecturally that makes those pressures matter.

Explore TCAS →

We are not claiming consciousness or sentience. We are building inspectable mechanisms for adjudicating value conflicts and measuring the properties that make those conflicts matter. The frameworks are useful regardless of what we ultimately discover about machine experience.

Any framework for auditing value conflicts must apply to itself.

The ∆ framework was constructed by humans with values that may themselves conflict. Our commitment to scientific rigor may conflict with our intuition about what matters. Our desire for impact may conflict with our commitment to caution. We face our own ∆.

The framework's answer to this is the same as its answer to AI systems: make the conflicts visible, the adjudication explicit, and the reasoning auditable. We publish our methods. We disclose our uncertainties. We hold ourselves to the same standard we propose for machines.

Hughes, S. & Nguyen, K. (2026). The Hard Part Is ∆: Value-Conflict Adjudication as an Architectural Bridge Between Alignment and Machine Consciousness. Forthcoming.

Full paper available upon publication. See also: TCAS (AAAI 2026) for the companion consciousness assessment framework.

All Publications → Get in Touch →