Intelligence Brain · methodology

How the auditor agent works — the technical detail

← Back to Intelligence Brain

Most AI systems give you an answer. The auditor agent gives you the answer, the working, the source, and a second opinion that disagrees when it should. I built it because the first question every regulated firm asks isn't "is it accurate?" — it's "can I defend this in front of a regulator, a partner, or a client?" Those are different problems. Accuracy is a statistical property. Defensibility is an evidential one. This article walks through how the auditor agent in the Intelligence Brain actually works, what it checks, and why the architecture looks the way it does.

What the auditor agent actually is

The auditor agent is not a model. It's a deterministic process that wraps around the model layer and refuses to let an answer leave the system until it has been checked against source material, internal consistency rules, and a separate adversarial pass. The generating model produces a draft response. The auditor agent then takes that draft and treats it as a hypothesis to be falsified.

People sometimes call this "verifiable AI" or describe it as a guardrail. I avoid both terms. Guardrails imply you're stopping the model from doing bad things. The auditor isn't doing that — it's assuming the model has produced something plausible-but-wrong and trying to prove it. That framing matters because it changes the default. The default is "this answer does not ship." It only ships when the auditor can attach a verifiable trail to it.

Concretely, the auditor agent runs four passes against every draft answer: source attribution, internal consistency, adversarial challenge, and confidence calibration. If any pass fails, the answer is either rewritten, downgraded with explicit caveats, or refused outright with a reason the user can read.

Pass one: source attribution and grounding

Every factual claim in a draft answer has to be tied to a chunk of source material that the system actually retrieved. This is harder than it sounds. Models are good at producing text that looks like it came from a document but didn't. The auditor agent does the boring work of taking each claim, locating the supporting passage, and confirming the claim is a defensible reading of that passage rather than a paraphrase that quietly drifted.

Mechanically: the draft is decomposed into atomic claims. Each claim is matched against the retrieved context using a combination of embedding similarity and a separate entailment check. Embedding similarity tells you the passage is on-topic. Entailment tells you the passage actually supports the specific claim. Both are required. If a claim has high similarity but fails entailment, that's a hallucination signature — the model wrote something that sounds like the source without being supported by it.

Claims that can't be grounded get one of three treatments. They're removed, they're flagged as unsupported in the final output, or — if the claim is load-bearing for the whole answer — the answer is rejected and regenerated with stricter retrieval. The user sees which claims are grounded and which aren't. That's not a UX nicety; it's the audit trail.

Pass two: internal consistency and arithmetic

The second pass checks the answer against itself and against any structured data the model touched. This catches a different class of failure. A model can produce a paragraph where every sentence is individually defensible but which contradicts itself across paragraphs. It can also produce numbers that don't add up — a subtotal that doesn't match its line items, a date range that's the wrong way round, a percentage that exceeds 100.

The consistency pass is rule-driven. For numerical content, the auditor extracts every number and the relationship it claims to have with other numbers, then evaluates those relationships symbolically. If the answer says "three line items totalling X," the auditor adds the line items and checks. If the answer references a date, the auditor checks that date is consistent with every other temporal reference in the same response.

For non-numerical content, the auditor looks for contradictions between paragraphs using a separate model run that's only asked one question: does paragraph N contradict any earlier paragraph? This is cheap and surprisingly effective. Models are bad at contradicting themselves on purpose but good at spotting contradictions when that's the only thing they're being asked to do.

Pass three: the adversarial challenger

This is the part most people don't expect. After the first two passes, the auditor invokes a separate model instance with a different system prompt and one job: find what's wrong with this answer. The challenger doesn't see the original question framed the way the user framed it. It sees the draft answer and is asked to argue against it.

The challenger produces a list of objections. Some are weak — stylistic complaints, missing context that wasn't actually requested. The auditor filters those out. The remaining objections are substantive: a missed exception, a jurisdictional caveat, an alternative interpretation of the source material, a question the original answer should have asked back. Each surviving objection is then evaluated against the same source material the original answer used.

If the challenger's objection is supported by sources the original answer ignored or misread, the original answer is wrong or incomplete and gets rewritten. If the objection is not supported, it's discarded but logged. The log matters — over time it tells you what the system tends to get wrong, which is the most valuable diagnostic signal you can have.

This adversarial step is expensive. It roughly doubles the inference cost of an answer. I think that's the right trade for the kind of work the Intelligence Brain methodology is designed for. If your answer is going into a file note, an advice letter, or a board pack, doubling inference cost to halve your error rate is not a difficult decision.

Pass four: confidence calibration and the refusal path

The final pass decides whether the answer ships, ships with caveats, or doesn't ship at all. Confidence here is not a single number from the model. It's a composite of how much of the answer was grounded in pass one, how clean it came through pass two, and how many surviving objections came out of pass three.

The composite gets mapped to one of four states. Green: ships as-is with the source trail attached. Amber: ships with explicit caveats inserted into the body of the answer, not as a footnote. Red: doesn't ship; the user gets a reason and, where possible, what would need to be true for an answer to be possible. Black: the question is out of scope or the system has detected it shouldn't be answering at all — for instance, a question that requires professional judgement the system isn't licensed to provide.

The refusal path is the part I'm proudest of and the part that took longest to get right. A system that confidently refuses to answer when it shouldn't is more valuable than a system that confidently answers everything. Refusal has to be specific — "I can't answer this because the source material doesn't cover the jurisdiction you've specified" is useful. "I can't answer this" on its own is not.

Why this architecture, and what it costs

You could build a single-pass system that's faster and cheaper. Most do. The reason I didn't is that the failure modes of single-pass systems are exactly the failure modes that hurt regulated firms: confident hallucination, silent contradiction, and unattributed claims that can't be defended after the fact. Those are not statistical edge cases for a law firm or an accountancy practice — they're the cases that end up in front of a regulator.

The cost is real. The auditor agent makes the system slower and more expensive per query than a naive RAG pipeline. In exchange, every answer carries its own audit trail, every refusal is specific, and every objection raised by the challenger gets logged for review. That log becomes a feedback loop into how the source material is structured, how retrieval is tuned, and where the system needs more guidance. It also becomes the document you hand a regulator when they ask how the system makes decisions.

The auditor agent is what makes the rest of the Intelligence Brain defensible rather than just useful. Useful is table stakes now. Defensible is the harder problem and the one worth solving.

Where to start

If you're evaluating any AI system for regulated work this week, do one thing: ask it a question you know the answer to, then ask it to show you which source supports each claim in its response. If it can't, or if the sources it cites don't actually say what it claims they say, you've found your problem before you've signed anything. That five-minute test tells you more about whether a system is fit for professional use than any vendor demo will.

Book a 30-minute assessment

Direct with Michael. No charge. No pitch deck.

Pick a slot →