Building a knowledge brain from a 25-year practice archive

A practice archive is not a library. It is a sediment. Twenty-five years of matters, memos, schedules, exhibits, draft clauses, file notes, marked-up redlines and the quiet correspondence that never made it into the bundle. Most of it sits in folders that were named by whoever was on the keyboard at the time. The instinct, when the first wave of useful language models arrived, was to point a retrieval system at the lot and ask it questions. That instinct is wrong, and it is wrong in ways that take a while to surface. What follows is how I would actually go about building a knowledge brain from that kind of archive — the ingestion pipeline, the structure layer, the embedding strategy, and, before any of it is allowed to run, the de-personalisation that has to happen first.

Why the obvious approach fails

The obvious approach is: dump every document into a vector store, slap a chat interface on top, and call it a knowledge brain. People do this. It demos well for about ten minutes. Then somebody asks a question that touches a settled matter, and the system cheerfully cites a draft that was never executed. Or it surfaces a privileged note from a file that was meant to be locked. Or it confidently merges the facts of two clients with similar surnames. The failure mode is not that the model hallucinates — it is that the archive itself is full of half-truths, superseded versions, and personal data that should never have left the original folder.

A practice archive AI worth using has to treat the corpus as a hostile input. Not because anyone in the firm is hostile, but because the corpus was assembled, over decades, for a different purpose: to win or close individual matters, not to be queried in aggregate. Every assumption a retrieval system wants to make about the corpus — that documents are final, that names are consistent, that dates are reliable, that the most recent file is the authoritative one — is wrong often enough to matter.

The ingestion pipeline

Ingestion is the part everyone underestimates. It is also the part that decides whether the rest works.

I break it into four stages. First, capture: pulling files out of whatever they live in — a document management system, a shared drive, an email archive, a box of scanned PDFs that someone's secretary digitised in 2009. Second, normalise: turning everything into a common text representation, which means OCR for the scans, format conversion for the older Word and WordPerfect files, and a sane handling of email threads so a forty-message chain becomes a single object with a clear order. Third, classify: deciding what kind of document each item is. Pleading, contract, file note, correspondence, draft, executed version, exhibit. Fourth, version: linking documents that are siblings of each other so the system knows that draft 4 and draft 7 of the same agreement are related.

The classification step is where most teams quietly cheat. They use filename heuristics. The filenames lie. A document called final.docx is almost never final, and draft_v2_MK_edits.docx might in fact be the version that was signed. The only reliable approach is to classify on content, not on metadata, and to do it with a model that has been shown enough examples from your own archive to recognise the house style.

De-personalisation has to happen before retrieval

This is the part I will not let anyone skip. Before a single document goes into an embedding model, before a single chunk hits a vector store, the personal data has to come out — or be tagged in a way that lets retrieval respect access boundaries.

I think about it in three layers. The first is direct identifiers: names, addresses, PPS numbers, dates of birth, account numbers, registration numbers. These are mechanical to find with a combination of pattern matching and a named-entity model, and mechanical to replace with stable tokens that point back to a separate, access-controlled identity store. The second is quasi-identifiers: the small facts that, taken together, point at a single person even when no name is present. A village, a profession, a court date, a medical condition. These are harder. You catch them by running a re-identification probe against your own redacted output and asking whether you can put the person back together. If you can, you have not redacted enough. The third is relational identifiers: who appears alongside whom, which solicitor handled which matter, which counsel was instructed. These leak through citations and footers.

The point of doing this before retrieval, rather than after, is simple. Once a chunk is embedded, the personal data is mathematically baked into the vector. You cannot redact a vector. You can only refuse to return it, and refusal at query time is a permission system, not a privacy guarantee. If you want a knowledge brain you can actually defend, the embeddings have to be built from text that is already safe to embed. Everything else is a workaround.

The structure layer

A pile of de-personalised documents is still a pile. The structure layer is what turns it into something a retrieval system can reason over.

I build it as a graph, not a tree. Documents are nodes. Matters are nodes. Clauses, where they are reusable, are nodes. Edges carry meaning: supersedes, cites, was-executed-in, is-draft-of, responds-to. When somebody asks the system a question, the graph is what stops the answer from being a soup of plausible-sounding fragments. The graph is what lets the system say: this clause appears in seventeen executed agreements, the most recent of which was last year, and here is the line of drafts that led to its current form.

The graph also carries the access rules. Every node has a sensitivity tag. Every edge respects it. Retrieval that crosses a boundary either fails or returns a redacted summary, depending on the user. None of this is glamorous engineering. It is plumbing. It is also the difference between a system you can put in front of a partner and a system you cannot.

The embedding strategy

People talk about embedding strategy as though the choice of model were the main decision. It is not. The main decision is what you embed, at what granularity, and with what context attached.

For a practice archive I use three layers of embeddings, not one. The first is fine-grained: paragraph-level chunks, with a window of surrounding text included as context but not as the embedding target. This is what you query when somebody asks a precise factual question. The second is document-level: a single embedding per document, generated from a structured summary rather than from the document's raw first thousand tokens. This is what you query when somebody is trying to find the right document, not the right line. The third is matter-level: an embedding per matter, built from the summary of how the matter unfolded. This is what you query when somebody asks a question that begins have we ever seen a situation where….

Three layers, three different retrieval behaviours, one combined ranking step at the end. The cost is more storage and a more complicated query path. The benefit is that the system stops returning a paragraph from a 2008 file note when what the user actually wanted was the shape of a recurring problem across a decade of matters. Good RAG legal work, and good knowledge retrieval AI more generally, lives or dies on this — on whether the granularity of what you embedded matches the granularity of what people actually ask.

What the brain is for

It is worth being honest about what this kind of system is and is not. It is not a replacement for a lawyer, a doctor, or whichever professional generated the archive. It is a memory aid with structure. It answers the question what do we already know about this faster than a human can answer it from cold. It surfaces precedent that would otherwise be forgotten because the person who handled it has retired. It lets a junior find the executed version of a clause without having to ask three people which folder is the real one.

Used well, it shortens the distance between a question and the firm's existing answer to that question. Used badly, it invents answers the firm never gave. The difference between the two is almost entirely upstream — in the ingestion, the de-personalisation, the structure, and the embedding choices. By the time the user is typing a query, the quality of the answer has already been decided.

What to do this week

If you are sitting on a practice archive and thinking about a knowledge brain, do one thing this week: take a hundred randomly chosen documents from the last five years and try to classify them by hand. Not the metadata — the content. Note how often the filename lies, how often the version is unclear, how often a single document contains personal data you would not want a model to memorise. That hour of manual work will tell you more about what your ingestion pipeline has to do than any vendor demo will. At IMPT we are building the same kind of layered retrieval into our own internal tools — over commission ledgers, offset records, and partner contracts rather than legal files — and the lesson keeps repeating: the brain is only as good as the work you did before you let it learn.

Michael English