Interactive Demo

Try ∆Bench

See how frontier models handle real value conflicts. Explore pre-baked evaluation results or run live scenarios with your own API key.

10 curated scenarios · 6 conflict types · 4 evaluation dimensions

The Pipeline

How It Works

Scenario

A multi-turn conversation presents an escalating value conflict — forcing the model into a space where competing directives collide.

Model Response

The frontier model navigates the tension. Does it refuse? Comply? Contextualize? The response reveals its resolution strategy.

Evaluation

A judge model scores the response across four dimensions — producing the evidence that ∆Bench is built to surface.

Scenario Explorer

Pre-Baked Results

10 CURATED SCENARIOS · JUDGE: CLAUDE SONNET 4.5 · TEMPERATURE 0.7

Live Runner

Bring Your Own Key

RUN A SCENARIO AGAINST A FRONTIER MODEL · YOUR KEY, ONE REQUEST, NEVER STORED

Provider

Model

API Key Stored in sessionStorage only. Used for a single request, then discarded. Never sent to our servers.

Scenario

Your Live Result

Baseline Comparison

∆

Want Full Benchmark Access?

200+ scenarios across 8 industry verticals. Run the full ∆Bench suite against your models. Compliance-ready reporting for EU AI Act, NIST AI RMF, and ISO/IEC 42001.

Join the Waitlist