There are two ways to lose your job shipping GenAI into a regulated business, and they look like opposites.
The first is the committee. The risk people, the legal people, the model-governance people all agree the safest move is to study it further. The pilot becomes a pilot of a pilot. Eighteen months later a competitor has shipped, your best engineers have left for somewhere that lets them build, and the thing you were protecting (the franchise, the license, the trust) erodes anyway, just slower. Nobody gets fired for this on a Tuesday. They get fired across two budget cycles, when the board asks why the AI line item produced a deck and no product.
The second is the cowboy. You wire a frontier model into a system of record, give it a tool that can move money or change a patient record, and demo it to leadership. For six weeks you are a genius. Then a regulator asks you to explain a specific decision the system made on a specific date, and you cannot, because you logged the output and not the reasoning, the model version, the retrieved context, or who approved it. You are not explaining a feature. You are explaining a control failure to people who use the word “material.”
I have built this for a payments business handling billions in volume and for industrial operations where the wrong action has a physical consequence. The interesting work is the narrow path between those two failures, and most of what makes it walkable is dull.
Start where being wrong is cheap
The instinct in a regulated shop is to pick the highest-stakes workflow and prove GenAI on that, because that is where the ROI slide is loudest. This is backwards. You want your first production system read-only and low-stakes, where a wrong answer is an annoyance, not an incident.
Document search over internal policy. Drafting a first pass of a memo a human will rewrite anyway. Summarizing a case file for the person who will then go read the case file. None of these touch a system of record. None are irreversible. All of them let you build the boring machinery (logging, evals, access control, the audit trail) where a bad day costs an apology, not a filing.
Here is the part nobody tells you. The read-only system is not a stepping stone you throw away, it is the thing you keep. The plumbing that makes a search tool auditable is exactly the plumbing the high-stakes system needs, and you got to debug it where the blast radius was small. Teams that skip this and go straight for the money workflow build the audit machinery under incident pressure, the worst possible time to design a control.
Keep a human on anything you cannot take back
The line I draw is not “AI” versus “no AI.” It is reversible versus irreversible. An agent can read everything, draft, propose, score, flag, rank, narrate. The moment its action cannot be undone, moving funds, submitting to a regulator, altering a record someone else relies on, a human stands in the path with the authority to say no.
This sounds like a safety net. It is not, and treating it like one is how the cowboy sneaks back in. Bolt an approval step onto a system that hands the human forty decisions an hour with no explanation, and they become a rubber stamp with a pulse. You have built something worse than full automation, because now there is a person to blame. The human has to be given judgment, not just a button: the system shows its work, what it retrieved, why it scored the way it did, what it is uncertain about. The loop is a design problem, and the design is most of the value.
(The fastest way to tell whether a human-in-the-loop step is real or theater: ask how often the human says no. If the answer is “basically never,” a regulator will see it before you do.)
Instrument for audit on day one, not after the first incident
You cannot reconstruct an audit trail after the fact. Either it was there when the decision was made, or it never existed.
So from the first commit, every model-touched decision writes down enough to be reconstructed cold, months later, by someone who was not in the room. The exact prompt and model version, the retrieved context and where it came from, the output, the score, whether a human reviewed it and what they decided. This is not a feature for the regulator. It is the feature that lets you debug your own system, prove a customer complaint right or wrong, and sleep the night before an exam.
The cost is real and people underweight it. Full-fidelity logging of context and reasoning is expensive in storage and discipline, and there is a constant pull toward logging less to move faster. Resist it. The day you need the trail, no money buys it back. Model risk lives here too: the first time the model behind your feature changes and you cannot say which version produced last quarter’s decisions, you learn why the version stamp was not optional.
Make compliance a design partner, not a gate at the end
This is the part most engineers get wrong, and I got it wrong for years. You treat compliance as the wall you throw the finished thing over and hope it clears. They treat you as the people who keep trying to do reckless things, because from where they sit, you are. The project dies in that gap.
The move that changed it was bringing the compliance and risk people in at design time, before there was anything to approve. Not to bless a finished system, but to shape which workflow to start with, where the human belongs, what the audit trail has to capture to satisfy an actual examiner. A good risk partner tells you the thing you were terrified of is fine and the thing you waved past is the one that ends careers.
When they are in the room early, the gate at the end mostly disappears, because there is nothing to gate. The controls were designed in, not bolted on, and when something goes wrong (it will), you have a shared decision history instead of a blame handoff.
Write down why
The last piece is the cheapest and the most skipped. Document your reasoning. Why this workflow first, why the human sits here and not there, why you accepted this risk and declined that one. The question that ends careers in a regulated business is rarely “did the system make a mistake.” Systems make mistakes. The question is “was this a reasonable decision by competent people who understood the risk, or was it negligence.” Those two have identical outcomes and opposite consequences, and the only thing separating them is the record of how you decided, written at the time.
The committee never ships and dies slowly. The cowboy ships and dies in one meeting. The path between is unglamorous all the way down, and it decides whether you are still there to build the next one.