Shipping ML to Twenty Teams: The Platform Bet That Paid Off

The day I knew the platform had worked was the day I found out about a model in production I had never heard of.

A fraud team had shipped a new scoring model. New features, new training run, live in the authorization path, taking real traffic. Nobody told my team. Nobody needed to. They pulled features from the store, trained on the standard pipeline, pushed to the serving layer behind the standard endpoint, and watched the dashboards we gave them. The whole thing went out while we were arguing about something else entirely. Two years earlier that same change would have meant a ticket, a planning meeting, three weeks in our queue, and a deploy one of my engineers babysat at night.

That is the bet. A self-service ML platform is a bet that you can take yourself out of the critical path and the org will go faster, not break. We made it for a marketplace where a couple dozen product teams all wanted models in their features, and where my central team was the bottleneck for every single one. This is the two-year retrospective. Some of it I would do again tomorrow. Some of it was vanity dressed up as architecture.

What we actually built

Four layers, and the order we built them in matters more than the list.

The ML platform as layers: feature store, pipelines, serving, and vector store underneath, with product teams self-serving on top

The platform is the bottom three layers. The teams live on top, and the win is how rarely the two have to talk.

A feature store, so teams stopped recomputing the same user and merchant features in slightly different, slightly wrong ways. Standardized training pipelines, so a model went from notebook to artifact the same way every time and we could actually reason about what was running. A model serving layer with one deploy path, one rollout pattern, one place to roll back. And later, a vector store, because retrieval and embedding workloads showed up across enough teams that running them ad hoc had become its own tax.

None of those layers is novel. The bet was never in the components. It was in the contract: a paved path good enough that a team’s fastest route to production runs straight through it, not around it.

What it got right

The thing that worked was treating the platform as roads, not rules. We did not mandate the feature store. We made it the path of least resistance and let the mandate be implicit. If pulling a precomputed feature was two lines and recomputing it yourself was a day of plumbing plus a backfill job plus a monitoring story you had to invent, people pulled the feature. The standard won because it was easier, not because a committee blessed it.

Standardized serving was the single highest-leverage thing we did. One deploy path meant one set of safety properties for everyone. Canary rollout, automatic rollback on error-rate regression, shadow traffic before anything took real load. We built that once, carefully, and every team inherited it whether they understood it or not. The fraud team that shipped without telling me got a safe rollout for free, because safe was the default and they would have had to work to opt out of it.

And the deploy friction came down hard. The cost of getting a model from trained to serving, measured honestly in engineer-days, fell by something close to a third over those two years. Not because we wrote faster code. Because we deleted steps. Every handoff between the modeling team and my platform team was a place a model could sit for a week, and we removed most of the handoffs.

What it got wrong

Here is the part nobody tells you about platform work. The failures are not crashes. They are things you built beautifully that nobody wanted, and you do not find out for a year.

We built the vector store too early. We saw embedding workloads coming, got excited (it was 2023, everyone was excited), and stood up a dedicated, well-operated vector serving layer ahead of real demand. For most of a year it served two teams who would have been completely fine with the vector extension on the Postgres they already had. We carried the operational weight of a system sized for a wave that had not arrived. The lesson was not “vector stores are bad.” It was that platform teams are supposed to pave roads people already walk, and we paved one into a field because we could see a town might be built there someday.

The deeper mistake was over-abstraction. We had a theory that every model was an instance of a general shape (features in, training run, artifact out, served behind an endpoint) and we built abstractions to match the theory. They were elegant. They were also a wall. A team with a model that did not fit the shape, a streaming feature, an odd serving requirement, a latency budget our generic path could not hit, had two options: contort their problem to fit our abstraction, or fork off the platform entirely. Some forked. Every fork was a small indictment. The abstraction that was meant to serve everyone had quietly started deciding which problems were allowed to exist.

The fix, when we finally admitted it, was to expose the layers underneath the abstraction. Let a team drop down to the raw serving primitive when the paved path did not fit, instead of forcing the choice between conformity and exile. A good platform is not one interface. It is a paved road with marked exits, and you have to design the exits as deliberately as the road.

The only metric that mattered

We tracked the usual things. Adoption, models in production, the deploy-cost number I am proud of. They were all real and they all missed the point.

The honest measure of a self-service platform is how often the platform team is in the loop on a launch. Early on we were in every loop, because nothing shipped without us. Success was watching that number fall toward zero while the number of models shipped climbed. The fraud model I never heard about was not a process failure. It was the entire thing working. The platform team becoming optional, by design, for the ordinary case, and present only for the genuinely hard ones, the new layer, the new safety property, the team whose problem did not fit yet.

That is the uncomfortable success condition. You build a thing whose victory looks like your own absence. The org ships more models and talks to you less. If your platform’s health depends on teams needing you, you did not build a platform. You built a service desk with better branding, and you are still the bottleneck, just a more expensive one.

Would I make the bet again? Yes, and earlier, and with fewer abstractions. The roads paid for themselves. The cages I built by accident I had to tear down one at a time, and the teams remembered every one. If you take nothing else from two years of this: build the road people are already walking, mark the exits, and measure your success by how rarely anyone has to come find you.