Skip to content
Ryan de Melo
Go back

Sovereign AI: Running GPUs On-Prem When the Cloud Isn't an Option

The data cannot leave the building. Not the country, the building. That sentence ended a six-week cloud architecture and started a procurement cycle for racks, and it is the most common sentence I have heard in regulated AI work this year.

For a long time the answer to “we need GPUs” was a region selector and a credit card. That answer is still right for most teams. But there is a growing set of workloads where the data sits under a residency rule, a sovereignty clause, or a contract with a counterparty who will walk if their records touch a hyperscaler. When the legal boundary is a physical address, the cloud is not a cheaper option that you reluctantly skip. It is not an option at all. So inference comes home, onto a fixed fleet of GPUs you own, in a room you can point at.

I have built on both sides of this line. Here is what the on-prem version actually costs, and the part of it that has nothing to do with money.

The build-vs-rent math is not the math you think

Everyone opens this decision as a unit-cost comparison. Dollars per GPU-hour rented versus dollars per GPU-hour owned, amortize the hardware over three years, find the break-even utilization, done. That arithmetic is real and you should do it. It is also not the decision.

The decision is whether you can rent at all. If the data is under a hard residency rule, the cloud price is infinite, because the option does not exist for this workload. The math collapses to: what does it cost to do the only thing you are allowed to do. The comparison is not cloud versus on-prem. It is on-prem versus not shipping the product.

For the workloads where you genuinely have a choice, the honest version of the break-even is this. Rented GPUs win when your demand is spiky, when you are still discovering the workload, and when idle capacity would sit there mocking you. Owned GPUs win when demand is steady and predictable, when you will saturate the fleet, and when the data constraint or the long-run volume makes the rental meter unbearable. The hardware is the cheap part of owning. (The hardware is never the expensive part of anything.) The expensive parts are power, cooling, the supply chain, and the people who keep it alive at 3am.

The constraints nobody puts in the slide

Here is the part nobody tells you. In the cloud, capacity is someone else’s problem and elasticity hides every planning mistake you make. On-prem, the fleet is fixed, and every constraint you used to externalize lands on your desk at once.

Power is the first wall. High-end inference GPUs are not casually power-hungry, they are aggressively so, and a rack of them can exceed what the building’s existing electrical feed was provisioned to deliver. You do not discover this in a design review. You discover it when facilities tells you the floor’s circuit cannot take another rack and the upgrade is a construction project with a permit.

Cooling is the second wall, and it arrives right behind the first, because every watt you push in comes back out as heat. Air cooling has a ceiling that a dense GPU rack blows straight through, which is how you end up in a conversation about liquid cooling that nobody on the team signed up for.

Supply is the third. You cannot autoscale a thing you have to wait months to take delivery of. Lead times on the good accelerators stretch, allocation is political, and the unit you spec today may not be the unit you can actually buy this quarter. Capacity planning without elastic scale means you are forecasting demand a year out and committing capital to that forecast, with no slider to nudge if you guessed low. Guess high and you own idle silicon. Guess low and you are rationing GPU time across angry teams while a purchase order crawls through procurement.

That is the trade you are signing. You traded a variable bill and infinite elasticity for a fixed asset and a hard ceiling. Nobody hands you that ceiling back.

Making a fixed fleet feel like a platform

A pile of GPUs is not a platform. It is a pile of GPUs, and if you hand it to teams raw, the strongest team grabs all of it and everyone else files tickets. The whole job of the software layer is to take a fixed, finite pool and make it behave like the elastic service it can never actually be.

An on-prem inference platform: a scheduler over a fixed GPU pool, fronted by a model-serving gateway, serving several tenants, all inside a data boundary that nothing crosses

Everything lives inside the boundary. The scheduler is the thing that turns a fixed pool into something that feels shared and fair.

Three layers earn their place.

A scheduler that understands GPUs, not just CPUs and memory. It has to handle fractional allocation, because not every model needs a whole accelerator and stranding a full GPU on a small model is how you waste a fleet you cannot grow. It has to enforce quotas per tenant, so steady-state work and a sudden batch job do not starve each other. And it has to preempt, because the most expensive thing on the floor should not sit idle while a low-priority job hogs it.

A model-serving layer that batches aggressively and keeps hot models resident. Loading model weights onto a GPU is slow, so you do not want to pay that cost per request. Continuous batching, where the server packs requests from many callers into the same forward pass, is what turns a fixed throughput ceiling into respectable utilization. This is the layer that decides whether your owned GPUs feel fast or feel like a queue.

Multi-tenancy with real isolation. Several teams, often several regulated lines of business, share one fleet, and one tenant’s traffic spike cannot become another tenant’s outage or, worse, another tenant’s data leaking through a shared cache. Isolation here is not a nice-to-have. It is frequently the reason you were allowed to build on-prem in the first place, and a sloppy version of it hands the auditor exactly the finding they came for.

Get those three right and a fixed fleet starts to feel like a service. Teams ask for capacity and get it, batches run, nobody knows or cares that under the platform there is a hard number of cards in a rack that will not change until next year’s budget.

The downside you own forever

The honest cost of owning the fleet is that you own the idle time too.

In the cloud, an idle GPU costs you nothing the moment you release it. On-prem, an idle GPU costs you the same whether it is running flat out or sitting dark, because you already paid for it, and it is still drawing power and throwing heat just to exist. Utilization is no longer a nice efficiency metric. It is the entire economic case. A fleet running at a fraction of capacity is a capital decision that aged badly, in a rack, where everyone can see it.

So the work that never ends is squeezing utilization without breaking the isolation that justified the whole thing. Pack more tenants on. Batch harder. Schedule off-peak training into the gaps that inference leaves. Every one of those moves trades a little safety for a little efficiency, and you are the one deciding how much, on a fleet that cannot grow to bail you out.

That is sovereign AI when you strip the word “sovereign” off it. A fixed amount of compute, in a room, that you have to keep busy and keep separate at the same time, because the data was never allowed to leave. The teams that do this well are not the ones with the most GPUs. They are the ones who treat the ceiling as the design, not the disappointment.

Would you have signed for the racks if the rule had let you rent? Most days, the rule is the only honest reason to.


Share this post:

Previous Post
RAG Over Enterprise Records: The Boring Parts That Matter
Next Post
Small Fine-Tuned Models Are Beating Frontier on My Workloads