The Numbers Don't Lie

The compiler test suite now runs 302 tests across seven phases, with a live dashboard on the homepage showing pass rates, phase breakdowns, and compilation benchmarks.

Document status: INFORMATIVE, DEVELOPMENT JOURNAL Documenting the compiler test dashboard and why making test results public matters. Single canonical copy. February 2026.

Show, don’t claim

Most project pages say “well-tested” somewhere. It is a meaningless claim without evidence. The compiler test dashboard on the homepage exists so that engineers visiting urd.dev can verify the assertion themselves — phase by phase, test by test, with real numbers from the last CI run.

The dashboard is not a vanity metric. It is the project’s answer to a reasonable question: is this real, or is it vapourware?

What the dashboard shows

Three tabs. No decoration.

Phases. Every compiler phase — PARSE, IMPORT, LINK, VALIDATE, EMIT, E2E, and scaffolding — with its pass count, progress bar, diagnostic code range, duration, and category breakdown. Each phase card expands to show individual test names and their results. An engineer can drill from “302 tests passing” down to “the frontmatter_basic test in the PARSE phase passed” in two clicks.

Compliance. Six architecture compliance checks that the test suite validates:

CheckWhat it proves
Deterministic outputSame source, same compiler version, same JSON — byte-identical
Error recoveryMark damaged constructs and continue — no cascading failures
Phase contractsInput/output types enforced at every phase boundary
Diagnostic ownershipEach phase owns a non-overlapping code range (URD100–URD599)
ID derivationEntity IDs, section IDs, choice IDs generated by consistent rules
Annotation modelForward references resolved via symbol table annotations

Benchmarks. Per-file compilation timing with phase-level breakdown. Source bytes in, output bytes out, total milliseconds, and the split across parse/import/link/validate/emit. Aggregate throughput in bytes per millisecond.

The pipeline

The test report is a generated artifact, not a committed file. The pipeline:

cargo test ─→ parse stdout ─→ test-report.json ─→ copy to site ─→ Astro builds it in

Locally, pnpm build:full chains the steps. In CI, the deploy workflow installs the Rust toolchain, runs the test suite, copies the report into the site’s data directory, and builds. Every push to main that touches packages/compiler/** or sites/urd.dev/** triggers a fresh deploy with current numbers.

The report is generated by scripts/compiler-test-report.mjs — a zero-dependency Node.js script that runs cargo test, parses the output, runs the benchmark harness, and writes structured JSON conforming to scripts/compiler-test-report.schema.json.

The benchmark harness

Benchmarks use a small Rust binary (packages/compiler/src/bin/bench.rs) that accepts a .urd.md file, compiles it through all five phases with per-phase timing via std::time::Instant, and prints a JSON line to stdout. The Node.js script runs this binary in release mode against every fixture file — release mode because debug builds are 10–50x slower and would produce misleading numbers.

The fixtures include the canonical test files: the two-room key puzzle, the tavern scene, the interrogation, and the Monty Hall problem. Each file exercises different parts of the grammar and different compiler paths.

Why this matters for a clinical system

The CLAUDE.md at the root of this repository opens with: “Important: The production code will run in a hospital.” That is not a marketing line. Urd and Wyrd will be used as a clinical narrative and decision support layer.

In that context, “the tests pass” is not sufficient. Engineers evaluating whether to integrate with Urd need to see which tests, how many, across which phases, and how fast. They need to see that error recovery works — that a malformed import does not crash the compiler but produces a structured diagnostic with a code, a message, and a suggestion. They need to see that output is deterministic — that the same source always produces the same JSON.

The dashboard makes all of this inspectable. Not in a README that might be stale, but in a live artifact generated from the current test suite on every deploy.

The numbers today

As of this writing:

  • 302 tests across 7 phases
  • 100% pass rate
  • 6 architecture compliance checks, all passing
  • 4 benchmark fixtures compiled in release mode
  • Five compiler phases fully implemented: PARSE, IMPORT, LINK, VALIDATE, EMIT
  • 14 end-to-end integration tests proving the full pipeline

These numbers will change. The dashboard will reflect that in real time. That is the point.

What comes next

The test dashboard is the last piece of the compiler’s public interface. The compiler itself — all five phases, 302 tests, structured diagnostics, deterministic output — is complete for v0.1. The next phase is the reference runtime: Wyrd.

Wyrd loads .urd.json, executes the world, and produces events. The validation milestone — compile the Monty Hall problem, run it ten thousand times, prove the switching advantage converges to two-thirds — requires both the compiler and the runtime. The compiler is ready. Wyrd is next.

This article is part of the Urd development journal at urd.dev.