Skip to content
Ryan de Melo
Go back

RAG Over Enterprise Records: The Boring Parts That Matter

The fastest way to lose an enterprise deal is to answer a question correctly for the wrong person. Not a hallucination. A perfectly accurate paragraph from a compensation review, summarized for a junior analyst who was never allowed to see it, with a tidy citation pointing right back at the source. The model did everything right. The system committed a data breach.

I have built RAG over corporate records more than once now: HR files, contracts, finance ledgers, support tickets, the kind of corpus where half the documents have an audience and the other half have a need-to-know. The demo is the same afternoon it always is. Index the documents, embed the question, stuff the top chunks into a prompt. It looks finished. It is not within a mile of finished, because in the enterprise the interesting problems are not in the search. They are in everything around it.

Here is the thesis, and I will defend it for the rest of this post. Retrieval is an access-control problem wearing a search costume. The relevance ranking is the part tutorials obsess over and the part that was mostly solved years ago. The part that decides whether this thing can go live is whether the right person, and only the right person, can see the right record, at the moment they ask, with a citation they can actually open.

The trap is the embedding, not the answer

The first instinct everyone has, including me the first time, is to filter at the end. Retrieve broadly, generate the answer, then check whether the user was allowed to see the sources, and redact if not. It feels safe. It is not.

By the time you are redacting, the privileged text has already been embedded into the prompt and sent to the model. It has shaped the answer. Even if you strip the citation, the summary in front of the user is built on content they had no right to. (You will see this the first time a model paraphrases a number from a document it then “declined” to cite. The number is still on the screen.)

So permission is not a post-filter. It is a retrieval predicate. The vector search itself has to be incapable of returning a chunk the asker cannot read. That is the whole ballgame, and it is unglamorous enough that most RAG content skips it.

Permission-aware retrieval, by hand

You attach the access control list to the chunk at index time, and you pass the user’s principals into the query so the store filters before similarity is computed. Not after. Before.

def retrieve(question, user, k=8, floor=0.74):
    # The user's full identity for ACL purposes: their own id plus every
    # group/role they inherit. Resolve this fresh per query. People get
    # removed from groups, and a stale principal set IS a security bug.
    principals = identity.principals_for(user)   # {"u:482", "grp:finance-ro", ...}

    q = embed(question)

    # The ACL filter runs INSIDE the index scan, not as a Python post-step.
    # If a chunk's allow-set shares nothing with the user's principals,
    # the store never even scores it. It cannot leak what it never returns.
    hits = index.search(
        q,
        k=k,
        where=acl_overlaps("allow_principals", principals),
    )

    keep = [h for h in hits if h.score >= floor and not h.doc.deleted]
    if not keep:
        # No readable, confident source. Say so. Do not widen the ACL
        # to find an answer. That is how you turn a search box into a leak.
        return None

    return [h.doc for h in keep]

The thing I want to underline is acl_overlaps living in the where clause. That is the difference between a system you can take to a security review and one that fails it. If your vector store cannot do a metadata filter as part of the search, that is not a minor inconvenience. It is a reason to pick a different store, or to shard your index by audience so the wrong rows are not physically present in the scan.

Two more things in that snippet earn their keep. not h.doc.deleted, which I will come back to. And resolving principals per query instead of caching it on a session, because the gap between “person leaves the finance team” and “person stops retrieving finance documents” should be measured in seconds, not in whenever their token expires.

Here is the part nobody tells you

The hard part of enterprise permissions is not the allow-list. It is that the allow-list is computed somewhere else, by a system you do not own, and it changes constantly. A contract gets reassigned. A deal closes and the room shrinks. Someone goes on leave and their delegate inherits their inbox. Your static allow_principals field on the chunk is a photograph of a permission that is no longer true.

You need both halves. Push: when the source system changes an ACL, it emits an event and you re-stamp every affected chunk. Pull: at query time you re-resolve the user’s principals live, so a stale chunk-side ACL still gets intersected against a fresh user-side identity. The intersection is only as safe as the staler of the two. Treat any source that cannot emit permission-change events as one you re-index on a tight clock, because you have no other way to know when its truth moved.

Freshness, deletes, and the right to be forgotten

A record in an enterprise is not a document on a website. It changes. It gets superseded. Sometimes it gets deleted, and not as a nicety: a legal hold lifts, a customer exercises a deletion right, a retention window closes. When that happens, the embedding of that record has to be gone from your index, and “gone” has to mean gone, not flagged.

This is where “just re-embed everything nightly” quietly fails. A full nightly rebuild leaves a deleted record answerable for up to a day, which in a regulated shop is not an inconvenience, it is a finding. So you run incremental: the source emits a change, you re-embed that record, and a delete is a hard removal, not a deleted=true you remember to filter on. (Soft-delete-and-filter is how the deleted document slips through the one code path that forgot the filter. Make it physically impossible instead.)

def apply_change(event):
    doc_id = event.doc_id

    if event.op == "delete":
        # Hard delete. The vector and its source text both leave the store.
        # A tombstone we forget to filter is a breach with a timestamp.
        index.purge(doc_id)
        provenance.record_purge(doc_id, reason=event.reason, at=event.ts)
        return

    record = source.fetch(doc_id)          # current truth, not a cached copy
    if record.acl_version != event.acl_version:
        # The ACL moved between the event and our fetch. Re-fetch, don't guess.
        return apply_change(event.refresh())

    chunks = chunk(record.blocks)
    vectors = embed_all(chunks)
    index.upsert(
        doc_id,
        vectors,
        allow_principals=record.allow_principals,
        source_version=record.version,      # lineage starts here
        updated_at=record.ts,
    )

I keep source_version and updated_at on every chunk because the next demand from any serious buyer is not “is the answer good.” It is “where did this come from, and was it current.” Which brings me to the last boring thing.

Lineage a human can actually check

A citation the user cannot open is decoration. If your answer says “per the FY24 services agreement” and the link 404s, or worse, links to a document the user is not allowed to open, you have built a thing that looks trustworthy and is not. The citation has to resolve, in the user’s own permission context, to the exact version of the record retrieved. Same source_version you stamped at index time. Not “the latest copy,” which may have changed since the answer was generated, and not a deep link that bypasses the source system’s own access check.

The test I use is blunt. Pick any sentence in the answer, click its citation, and land on the specific paragraph of the specific version that produced it, gated by the same permissions the retrieval used. If a sentence has no citation that survives that click, it does not get to be in the answer. An enterprise user does not want a confident summary. They want to verify, in ten seconds, that the machine is not lying, then forward the source to their boss with their own name on it.

None of this is the model. The model has been good enough for this job for a while. What separates a RAG demo from a system a regulated company will run is permissions enforced inside the retrieval, freshness measured in seconds, deletes that are deletes, and a citation a human can open and trust. It is plumbing. It is the boring part. It is the entire product.

So before you tune another reranker, ask the uncomfortable question: if I retrieve as the wrong user, what comes back? If you do not already know, you do not have a search problem. You have a security one.


Share this post:

Previous Post
Multimodal AI in the Field: Voice, Image, Form, Action
Next Post
Sovereign AI: Running GPUs On-Prem When the Cloud Isn't an Option