Intelligence Brain · product

The data pipeline of the intelligence brain — ingest to retrieve

← Back to Intelligence Brain

Most "RAG demos" fall over the moment you point them at real organisational data. A folder of clean PDFs is not a knowledge base. A SharePoint site with fifteen years of overlapping drafts, scanned faxes, password-protected spreadsheets, and Outlook archives is a knowledge base. The Intelligence Brain pipeline exists because the gap between those two things is where every generic AI tool fails inside a regulated firm. This article walks through how the pipeline actually works — ingest, parse, chunk, embed, index, retrieve — and the engineering decisions behind each stage.

Why the pipeline is the product

The model is not the product. The pipeline is. Any half-decent open-weights model will answer well if you hand it the right three paragraphs from a 4,000-document corpus. The hard part — the part that takes months of engineering, not an afternoon of prompt-tuning — is reliably getting to those three paragraphs from a messy, permission-controlled, multi-format estate that sits behind a firewall.

When I designed the Intelligence Brain, I made one architectural decision early: the pipeline runs on the customer's hardware, end to end. No data leaves the building. That constraint shapes everything downstream — you cannot lean on hosted parsing APIs, hosted embedding endpoints, or hosted vector stores. Every stage has to run locally, deterministically, and on commodity kit. That's harder, but it's also the only honest answer for a firm under the Solicitors Accounts Regulations, GDPR, or sector-specific supervisory regimes.

Stage one — ingest and source connectors

Ingest is the boring part nobody writes about, and it's where 60% of the engineering hours go. The Intelligence Brain ships connectors for the file systems and applications I see again and again in Irish SME and mid-market environments: Windows file shares, SharePoint and OneDrive, Microsoft 365 mailboxes, Outlook PST archives, network-attached scanners, document management systems used in legal and accounting practice, and standard SQL databases.

Each connector does three things. First, it enumerates — walks the source and produces a manifest of every object with its native identifier, last-modified timestamp, size, and access control list. Second, it deduplicates against the manifest from the previous run, so we only re-process what has changed. Third, it streams the binary content into a staging area on the local appliance, never holding more in memory than a single object at a time.

The ACL is preserved alongside the content. This matters more than people realise. If a partner's compensation review is in a folder only three people can see, the pipeline must carry that permission all the way through to retrieval, or you've built a very expensive leak. We attach the ACL as metadata at ingest and enforce it at query time — more on that below.

Stage two — parsing and OCR

The parsing stage takes raw bytes and produces structured text plus metadata. In an idealised demo this is one library call. In reality you are dealing with: native PDFs with embedded text, scanned PDFs with no text layer, PDFs that lie about having a text layer, Word documents with tracked changes, Excel files where the meaning lives in a pivot table, .msg and .eml mail with attachments-of-attachments, and a long tail of TIFF, HEIC, and proprietary formats from line-of-business apps.

The pipeline routes each format to a specialised parser. For PDFs we first attempt text extraction; if the extracted text is suspiciously sparse relative to the page count, we fall back to OCR. For OCR I use Tesseract with language packs configured for English and Irish, plus a layout model that preserves table structure — because a financial statement turned into one continuous run of digits is worse than useless.

For email, threading matters. A reply quoted inline ten times in a chain produces ten near-duplicate chunks if you parse naively. The parser collapses quoted history and keeps only the new content of each message, with a pointer back to the parent. This single decision dramatically improves retrieval quality on mailbox-heavy estates.

Every parsed document carries forward its provenance: source system, original path, ACL, parsed-at timestamp, parser version, and a content hash. Provenance is what lets you answer the auditor's question — "where did this come from?" — without hand-waving.

Stage three — chunking strategy

Chunking is where naive RAG implementations quietly destroy themselves. Splitting every document into 512-token windows with 50-token overlap is the default advice, and it's wrong for most real corpora.

The Intelligence Brain uses structure-aware chunking. For a contract, the unit is the clause. For a board pack, the unit is the agenda item. For an engagement letter, the unit is the section. For a long email thread, the unit is the individual message. We detect structure first — using headings, numbering schemes, and layout cues from the parser — and only fall back to fixed-window splitting when no structure is detectable.

Each chunk carries a small amount of contextual prefix: the document title, the section path, and where relevant the date. This is sometimes called contextual chunk enrichment, and it materially improves both embedding quality and the legibility of retrieved passages when the model assembles its answer. A chunk that reads "Clause 14.2 — Termination for convenience" is far more useful than the same prose with no header.

Chunk size is bounded but variable. I target roughly 200 to 800 tokens per chunk, letting the structure decide. Tiny chunks lose context; huge chunks dilute the embedding signal. The right answer is "as big as the natural unit of meaning, no bigger."

Stage four — the embedding pipeline

The embedding pipeline turns each chunk into a vector. Every embedding model in the Intelligence Brain runs locally on the appliance — typically on a GPU for throughput, or on CPU for smaller deployments where overnight indexing is acceptable.

I run two embedding passes, not one. The first is a dense semantic embedding using a current-generation open model — these capture meaning, paraphrase, and conceptual similarity. The second is a sparse lexical representation, essentially a learned BM25 variant, which captures exact terms, names, file references, and codes that semantic embeddings frequently smooth over. Hybrid retrieval over both indexes consistently beats either alone, especially in legal and accounting corpora where exact terminology is the whole point.

Embeddings are stored alongside the chunk text and full metadata in a local vector store. I've gone back and forth on which engine to use; the honest answer is that for the corpus sizes typical in Irish professional services firms, the engine matters less than the indexing strategy and the metadata schema. What matters is that the index is on the customer's disk, encrypted at rest, backed up like any other production database, and rebuildable from source in a documented procedure.

Re-embedding is a fact of life. When the embedding model is upgraded — which happens a few times a year as the open ecosystem moves — the entire corpus must be re-embedded. The pipeline is built to do this in the background without taking the system offline, swapping the index atomically when the rebuild completes.

Stage five — retrieval, reranking, and ACL enforcement

At query time, the user's question goes through its own short pipeline. We rewrite the query if it's ambiguous or pronoun-heavy, expand it with synonyms drawn from the firm's own terminology, and run hybrid retrieval against both the dense and sparse indexes. The top candidates — typically 30 to 50 chunks — go through a reranker, which is a smaller cross-encoder model that scores each candidate against the query directly rather than via vector similarity. The reranker is slower per item but only sees a shortlist, and it consistently lifts the right answer into the top three or four positions.

Critically, ACL enforcement happens before the reranker, not after the answer. We filter the candidate set down to only the chunks the asking user is permitted to see, based on the ACL metadata captured at ingest. The model never sees content the user couldn't open directly in the source system. This is non-negotiable in regulated environments and it's the single feature that, in my experience, distinguishes a usable internal AI from a compliance incident waiting to happen.

The final reranked, permission-filtered passages are passed to the local language model with a prompt that requires citation of the source document for every claim. The answer is rendered with footnotes back to the original files — clickable, openable, auditable. If the model cannot ground a claim in retrieved context, it is instructed to say so rather than guess. You can read more about how this fits into the broader architecture on the Intelligence Brain overview.

Where to start this week

If you're evaluating whether something like this would work in your firm, don't start with the model. Start with the data. Spend an afternoon listing your top five document repositories, estimating the total volume, and noting the formats, the access controls, and the number of people who would query it. That single inventory tells you more about the feasibility and value of an organisational intelligence layer than any vendor demo. If you'd like to talk through what that inventory implies for your specific environment, get in touch — that conversation is where every real deployment begins.

Book a 30-minute assessment

Direct with Michael. No charge. No pitch deck.

Pick a slot →