At a glance
Empirical findings
All three tribunals score at near-ceiling on every per-ruling primitive, and 2/2 on both architectural system properties. The pattern is robust to two stress tests, and seven executable traces from the corpus demonstrate the protocol's methodological coverage across formula, deferred conditional, bounded discretion, arithmetic and Boolean composition, statutory partial-refusal, and a third-party-jurisdiction gate.
Inferential layer. Bootstrap 95% coder-resampling intervals (10000 resamples; describe coding-procedure variance over the n=188 corpus, not population variance): ADGM 1.91 [1.89, 1.94], SICC 1.85 [1.80, 1.90], DIFC 1.72 [1.62, 1.81]. All three pairwise differences exclude zero at α=0.05 — the ranking is statistically supported, not point-estimate noise. Raw values: data/bootstrap_ci.json; computation: scripts/compute_bootstrap_ci.py.
Construct validity (external correlate, ran 2026-05-07). Per-judgment v0.2 mean correlates with appeal status at Spearman ρ = +0.32 and with subsequent-citation count at ρ = +0.12 across n=186 (data/robustness/external_correlate.json). The pre-registered H8 stop rule (|ρ| ≥ 0.10 in predicted direction on any of three external metrics) passes. Higher-scoring judgments are more likely to be referenced by appellate-court output.
Of the 188 judgments scored, 39 form an LLM-graded first-pass set (32 DIFC + 7 ADGM, scored by Claude Sonnet 4.5); the remaining 149 entries (69 ADGM + 80 SICC) are scored by deterministic regex heuristics — no LLM in the loop for these 149. Per-entry grader-type and provenance are recorded in coding.grader_type, coding.coder, and grader-type-specific fields (model + prompt SHA for LLM entries; producing-script path for regex entries).
- Grader-type stability (ADGM, the only tribunal with both grader types). The LLM grader (n=7) scores ADGM at 1.93; the regex heuristic-triage (n=16) scores 1.93; the regex heuristic-graded (n=53) scores 1.91. The two graders agree to within 0.02 on the overall mean — the within-corpus evidence that the saturation finding is a property of the tribunal rather than the grading instrument.
- SICC PR4 heuristic limitation, corrected. The regex grader produces PR4 = 1.55 for SICC because the four-marker triplet test fails on narrative grounds-of-decision documents. The corrected PR4 (Claude re-grades PR4 only with a prompt explicitly instructed to read narrative form) is what enters the headline SICC mean of 1.85. The regex result is preserved as the known-flawed measurement.
- Falsification cross-check. A 30-instrument falsification set across five non-court instrument classes (sealed awards, on-chain DAOs, regulator notices, platform adjudicators, UDRP panels) confirms the rubric separates real commercial courts from non-courts cleanly and does NOT mark down a positive control (UDRP, gap +0.05). The rubric measures procedural form, not pedigree.
- Cross-family replication. The protocol crosses legal-family boundaries — Singapore common law via the IAA, vs DIFC's own statutes and ADGM's English-law-via-statute — and translates to a civil-law foil under the peer-court comparison set. The protocol is not court-specific.
- Architectural system properties. All three tribunals score 2/2 on separation of powers (SP1) and enforceability under the New York Convention (SP2). These are the structural pre-conditions for plugging software into the bench.
- Methodological coverage. Seven executable traces from the corpus show the protocol covers (i) static-rule arithmetic, (ii) deferred conditionals, (iii) rule-bounded human judgment, (iv) arithmetic composition over substantive findings, (v) Boolean composition over contractual interpretation, (vi) NY-Convention partial refusal under Singapore IAA s 31, and (vii) a third-party-jurisdiction gate under Norwich Pharmacal + Bankers Trust + RDC 28.52.
The tribunals already exist; what is missing is the computational layer.
The Atlas — 188 fingerprints
One sigil per judgment. Each fingerprint below is generated deterministically from the primitive scores of a single ruling in the coded corpus. Same scores → same shape; different scores → different shape. Six concentric rings encode PR1–PR6, two outer arcs encode SP1–SP2, and a hash-seeded rosette gives every case ID its own face.
Read the rings, and you can read the court.
How to read a fingerprint
Six concentric rings encode the per-ruling primitives, innermost to outermost: PR1 rule source · PR2 typed evidence · PR3 machine-readable order · PR4 procedural state · PR5 reasoning trace · PR6 replayability.
A full ring with eight tick-marks means a perfect score (2). A half-arc with four ticks means a partial score (1). A faint dashed circle means absent (0).
Two outer arcs encode the system properties: SP1 separation of powers (top), SP2 appeal path (bottom). The central rosette is a hash-seeded ornament unique to the case ID — so two judgments with identical scores still wear different faces. A small dot at the upper-right marks one of the seven cases that became an executable trace.
The six per-ruling primitives
Properties of any individual ruling, scored 0 (absent) / 1 (partial) / 2 (fully implemented). v0.2 of the framework. Definitions live in data/primitives.json.
| ID | Name | What it tests |
|---|
System properties
Architectural facts about the tribunal as a whole, not properties of individual rulings. Scored once per institution. The score-0 row is what you avoid by anchoring at DIFC or ADGM rather than at an ad-hoc Web3 arbitration project.
| Tribunal | SP1 Separation of powers | SP2 Appeal path |
|---|
Seven working traces
Each trace lifts a real rule from the corpus into Catala source plus a Python evaluator and runs it against the case's event log. The seven span the full methodological spectrum: formula, deferred conditional, bounded discretion, arithmetic composition, Boolean composition, partial statutory refusal under Singapore IAA s 31, and a third-party-jurisdiction gate (Norwich Pharmacal + Bankers Trust). Three tribunals, three legal families, one engine. Trace #3 is the honest one — it shows what the rules cannot fully decide.
Trace viewer
Pick a trace. The left column is the rule as Catala source — the formal specification. The middle column is the event log — what happened, with the facts the human judge had to determine. The right column is the output — what the predicate produces when run against those facts, with assertions checked against the court's ruling.
All judgments
| Case | Tribunal | Date | Judge | Mean score | PR1 | PR2 | PR3 | PR4 | PR5 | PR6 |
|---|