A GP in Clonmel dictates twelve referral letters between morning surgery and lunch. By Friday it's sixty. The letters are clinically fine but the audit story behind them — who saw what, which version went out, what was redacted, what the LLM actually generated versus what the doctor wrote — is usually a black hole. That's the part that matters when a complaint lands two years later, or when the Medical Council asks how a piece of AI-generated text ended up in a patient record. This article is about how to draft clinical letters with an on-premise intelligence brain in a way that produces an audit trail you'd actually want to hand to a regulator.

Why clinical letter drafting is the wrong place to start with cloud AI

Most of the clinical letter AI tools I've seen demoed in Ireland are wrappers around a hosted American model. The pitch is fast: paste the consultation note, get a referral letter, copy it into Socrates or Health One. It works. It also creates four problems at once.

First, patient-identifiable data leaves the practice. Even with a Data Processing Agreement, the operational reality is that the prompt and the response sit on infrastructure you don't control, and the model provider's logging behaviour is not something a single-handed GP can meaningfully audit. Second, there's no stable record of which model produced the letter. Models get silently updated. The letter you sent in March wasn't drafted by the same system that exists today. Third, the redaction step — if it happens at all — happens client-side in a browser tab, with no log. Fourth, the doctor's edits are not captured as a diff. You see the final letter; you don't see what the AI proposed and what the clinician corrected. That last gap is the one that bites in a Medical Council inquiry.

An on-premise intelligence brain solves these by inversion: the model runs inside the practice network, every prompt and response is logged immutably, redaction is a server-side step with its own audit entry, and the human edits are captured as a structured diff against the model output.

What the audit trail actually needs to contain

Forget the marketing definition of "audit trail". For clinical correspondence in Ireland, under GDPR Article 30, the Medical Council's Guide to Professional Conduct and Ethics, and the practical reality of HSE complaint handling, you need to be able to reconstruct the following for any letter, years after it went out:

The exact source material the model saw — consultation note, problem list, medication list, any uploaded correspondence.
Which fields were redacted or pseudonymised before the model saw them, and which redaction ruleset version was applied.
The model identifier, including weights hash or version tag, not just "GPT-4" or "Llama 3".
The full prompt, including the system prompt and any retrieved context from the practice's own document store.
The raw model output, before any human edit.
The clinician's edits as a diff, with timestamps and the user identity.
The final letter as sent, with a hash, and the delivery channel — Healthlink, paper, secure email.
Any subsequent amendments and the reason recorded.

That's the minimum. Most cloud tools give you items one and seven. An on-premise brain gives you all eight by default because the logging is not optional — it's how the system is built.

The drafting pipeline, step by step

Here's how a referral letter actually flows through the medical intelligence brain in a single-handed or small group practice. I'll use a concrete example: a forty-eight-year-old woman with new-onset atrial fibrillation being referred to cardiology.

Step 1 — Source assembly. The PMS (Socrates, Health One, Helix) exposes the consultation note, problem list, medications, and recent investigations through either a structured export or a watched folder. The brain ingests these into a per-patient context window. Nothing leaves the practice LAN.

Step 2 — Redaction and pseudonymisation. A deterministic ruleset replaces the patient's name, address, MRN, and PPS number with stable tokens — PATIENT_001, ADDR_001 — before the text reaches the language model. The mapping is held in a separate encrypted store. The redaction log records which rules fired and on which spans. This matters: if a rule misfires and a name leaks into the prompt, you want to know.

Step 3 — Retrieval. The brain pulls relevant practice-specific context: the GP's preferred referral template for cardiology, any standing instructions from the consultant ("please include TSH and U&E"), and the practice's house style. This is retrieval-augmented generation against a local vector index, not an internet search.

Step 4 — Generation. The local model — typically a quantised open-weight model in the seven-to-seventy billion parameter range, depending on hardware — produces a draft letter. The full prompt and raw output are written to an append-only log with a content hash.

Step 5 — Rehydration. The pseudonym tokens are replaced with the real patient details. This happens after generation, so the model never saw the identifiers. The rehydration step is itself logged.

Step 6 — Clinician review. The GP sees the draft in their normal interface, edits as needed, and approves. The diff between raw output and final text is captured. This is the single most important audit artefact, because it shows the doctor exercised clinical judgement rather than rubber-stamping.

Step 7 — Dispatch and sealing. The final letter is hashed, the hash is recorded with the dispatch metadata, and the letter goes out via Healthlink or whatever channel the practice uses. The whole bundle — sources, redaction log, prompt, raw output, diff, final, hash — is sealed as one immutable record.

Hardware and model choices that don't embarrass you in two years

The temptation is to run the smallest model that produces readable letters. Resist it. Clinical correspondence has a low tolerance for hallucination, and small models invent dosages, mis-state laterality, and confabulate investigation results with cheerful confidence. For a practice of one to five GPs, a single workstation with a recent consumer GPU running a mid-sized instruction-tuned open-weight model is the sensible floor. For group practices and primary care centres, a small server with two GPUs gives you headroom and lets you keep a frozen "letter model" alongside an experimental one.

Frozen is the operative word. Whatever model you use for clinical letters, pin the version. Hash the weights. Record the hash in every audit entry. When you upgrade — and you will — keep the old weights available for replay, because regenerating a letter exactly as it was produced is part of the audit story. The brain should support this natively: every record points to the model artefact that produced it.

Avoid models with opaque training data provenance for this use case. If you can't say roughly what the model was trained on, you can't reason about its failure modes. There are several open-weight medical-tuned models with reasonable documentation; pick one and stick with it.

What goes wrong, and how the brain catches it

A few failure modes I've watched cause real problems, and how a properly logged on-premise system catches them.

The wrong patient context. Two patients with similar names; the PMS export pulls the wrong note. With a logged source-assembly step, the audit shows exactly which note went into the prompt, and a mismatch between the patient ID on the letter and the patient ID on the source note is detectable automatically.

Hallucinated investigations. The model writes "ECG showed atrial fibrillation with rapid ventricular response" when the source note only says "AF on examination". A grounding check — does every clinical claim in the output trace back to a span in the source? — runs as a post-generation step and flags unsupported claims for the clinician's attention.

Silent template drift. A practice's referral templates evolve. Without versioning, you can't tell which template version was applied to a letter from eighteen months ago. The brain stores templates with version tags and references the tag in every record.

Edit-then-forget. The clinician makes a substantive edit — changes a dose, removes a clinical claim — but doesn't record why. The diff capture forces the edit into the record even if the reason isn't supplied; the absence of a reason is itself information.

How this maps to GDPR and Medical Council expectations

GDPR Article 30 wants a record of processing activities. The brain's per-letter bundle is, in effect, a record of processing for that one letter — purpose, lawful basis (Article 9(2)(h) for healthcare), data categories, retention, recipients. Article 22 concerns about automated decision-making are addressed by the explicit human-in-the-loop diff: the clinician's edits and approval are the decision; the model is a drafting aid.

The Medical Council's expectation that records be contemporaneous, accurate, and clear is met more rigorously by a logged AI-assisted workflow than by dictation-and-typist, because the dictation pipeline rarely captures intermediate states. The HSE's emerging guidance on AI in clinical settings — still in flux — consistently emphasises explainability and human oversight, both of which fall out of this architecture for free.

If you want the broader picture of how this same logging discipline applies across legal, accounting, and other regulated work, the intelligence brain overview covers the common architecture.

Where to start this week

Pick one letter type — cardiology referrals, or discharge summary acknowledgements, or insurance reports. Map the current workflow on a single page: where the source data lives, who touches it, what gets sent, what gets retained. Then ask, for a letter you sent six months ago, can you reconstruct the source material, the draft, the edits, and the final? If the answer is no, that's the gap.

Book a 30-minute assessment

Direct with Michael. No charge. No pitch deck.

Pick a slot →

Clinical letter drafting with an intelligence brain — the audit trail