The auditor agent — turning hallucination into a build-failure signal

A fabricated case citation is the new typo, except it can lose a court filing, a regulator submission, or a client. The legal profession has spent two years finding this out the hard way. The fix isn't to ban the model, or to wrap it in disclaimers, or to make a paralegal re-read every paragraph at midnight. The fix is structural: treat every claim the model makes as untrusted input, and put an auditor agent between the model and the human whose name goes on the document. If the auditor can't reach the cite, the build fails. Same idea as a unit test. Same idea as a linter. Same red square in the pipeline.

What the auditor agent actually is

An auditor agent is a small, boring, deterministic-as-possible program that sits downstream of a generative model and asks one question of every assertion: can I verify this against a source the system trusts? It doesn't write. It doesn't reason creatively. It doesn't help. Its job is to fail loudly.

The architecture I keep coming back to has four moving parts:

A claim extractor that walks the model's output and pulls out every factual assertion, every citation, every named entity, every date, every quoted passage.
A resolver that maps each claim to a canonical source — a case database, a statute register, a court filing system, an internal document store — and tries to fetch it.
A matcher that compares the claim against the fetched source and decides whether the source supports it, contradicts it, or is silent on it.
A gate that aggregates the verdicts and returns pass / fail / human-review to whatever called it.

None of that is exotic. The hard parts are in the seams. But the shape is the shape, and once you have the shape you can argue about what "verified" means in your jurisdiction without starting over every time.

Why "the model checks itself" doesn't work

The temptation is to ask the same model that produced the citation to also confirm it. This fails for a reason that is structural, not anecdotal: a model that hallucinated Smith v Jones [2019] IEHC 412 with confidence will, when asked, also hallucinate that it confirmed the cite. You can't audit a system using the system. The auditor has to be epistemically separate. Different process, different inputs, different failure modes.

Practically, this means the resolver step has to hit a real index. A real court database. A real statute API. A real filing record. Not a vector store of "things the model has seen before". A vector store will gladly tell you a fictional case is similar to other fictional cases. The whole point is to pop the bubble.

Citation check AI: the part everyone underestimates

People hear "citation check AI" and assume the hard problem is matching strings. It isn't. The hard problem is what counts as a match.

Consider an Irish High Court judgment cited in a brief. The auditor needs to:

Resolve the neutral citation to a real document — not a record that the document exists, but the actual text.
Confirm the parties, the year, and the court.
Locate the paragraph the model claims supports the proposition.
Compare the model's paraphrase against that paragraph and decide whether the paraphrase is fair, narrow, broad, or wrong.
Flag if the case has been overturned, distinguished, or doubted in later authority.

That last step is where most "citation checkers" quietly fall over. A real cite that no longer means what the brief claims it means is, for litigation purposes, worse than a hallucinated cite. The hallucinated one gets caught. The stale one walks straight into the courtroom.

So the auditor needs not just retrieval but a notion of currency. Was this good law on the date the document was filed? It needs a clock and a citator graph. This is engineering work, not prompt work.

Turning a hallucination into a build-failure signal

Here's the part I care most about. If the auditor returns "this paragraph cites a case I cannot resolve", that result has to mean something. In a CI pipeline for code, a failing test stops the merge. The author can't ignore it without explicit override, and the override is logged. The same discipline applies to verifiable AI output in legal work.

What that looks like in practice:

Every generated document has a manifest. The manifest lists every citation, every quoted passage, every numerical claim, with a status: verified, unresolved, contradicted, stale, human-override.
The document cannot be exported, sent, or filed unless the manifest is clean — or the overrides are signed by a named human with a reason recorded.
The manifest travels with the document. The next person to touch it sees what was checked, what wasn't, and by whom.
If a citation goes stale after export — a case is overturned next week — the manifest can be re-run and the document flagged for review.

This is the move. Not "AI assistant", not "human in the loop" as a slogan, but a build artifact with provenance, attached to a gate that will refuse to open. Engineers have lived with this discipline for thirty years. We call it a green build. The legal version is a green manifest.

The AI compliance check as a separate concern

Citations are the loudest failure mode but they aren't the only one. An AI compliance check sits beside the citation auditor and asks a different set of questions. Does this output disclose what it must disclose? Does it avoid asserting things the firm's policy says it should not assert? Does it handle privileged material correctly? Does it carry the right jurisdictional framing?

I'd keep these as distinct agents with distinct verdicts. One pass for citations, one pass for compliance, one pass for tone-and-policy. Compose them. Don't merge them. When something goes wrong you want to know which gate threw the flag.

The instinct to bundle is strong because it feels efficient. It isn't. A monolithic auditor that returns "this looks bad" is the same problem you started with — an opaque verdict from an opaque process. Small auditors, narrow questions, plain answers.

What this costs, and why it's still cheaper

Running an auditor on every output is not free. You pay in latency, in retrieval costs, in engineering effort to keep the resolver mappings current as case databases and statute registers change their schemas. You also pay in the awkward conversation where someone says "I just want to send the email" and the gate says no.

The alternative cost is the one nobody likes to write down: one filed brief with a fake case in it, and the firm spends a quarter explaining itself. The arithmetic isn't close. The auditor pays for itself the first time it stops something. After that it's just operating expense, and the operating expense gets cheaper every quarter as the model providers improve and the resolver mappings stabilise.

Where this generalises

Legal is the cleanest example because the consequences are visible and the citations are structured, but the same architecture works anywhere a model produces claims that need to be true. Medical literature. Financial filings. Procurement. Anywhere there's a corpus of canonical sources and a definition of "supported by the source", you can stand up an auditor.

We've built a version of this thinking into the IMPT platform for a different purpose — every offset on a hotel booking has to resolve to an on-chain record, and a booking that can't produce that record is, for our purposes, a failed booking. Same shape. Generate, then prove. If the proof step fails, the whole thing fails. Carbon claims, legal claims, medical claims — the discipline is the same. Don't trust output you can't reach back through.

What to do this week

If you're running any AI workflow that produces claims a third party will rely on, do three things this week. First, list every category of claim your output makes — citations, numbers, quotes, dates, names — and write down, for each, what the canonical source would be and whether you can reach it programmatically. Second, pick the highest-stakes category and build the dumbest possible resolver for it: a script that takes the claim and tries to fetch the source. Don't try to match yet. Just measure the resolution rate. You'll be surprised, in both directions. Third, decide what the gate does when the resolver fails — and put that gate in front of the human who currently rubber-stamps the output. That last step is the one most teams skip, and it's the one that turns legal AI hallucination from a recurring incident into a build-failure signal that gets fixed once. We're applying the same pattern across our own stack at IMPT and on the agent work behind the forthcoming booking flow. The auditor is boring. That's the feature.

Michael English