Everyone Wants ChatGPT in Their Product. Most Should Wait.

Three meetings last week opened with the same sentence. “We need to get ChatGPT into the product.” Not “we have a problem that this might solve.” The feature first, the problem to follow. That order is the tell.

I get it. The demo is the most convincing thing any of us have seen in a decade. You type a sentence, it writes a paragraph that sounds like a person, and your brain does the rest. The leap from there to “this is a product” feels like a formality. It is not a formality. It is the entire job, and almost nobody asking for it in January has done the part that turns a demo into something you can put in front of customers and sleep at night.

So let me do the math out loud, because the demo hides all of it.

Start with latency. The thing that feels instant on the website is streaming tokens to one person watching them appear. Inside your product, in a real request path, you are waiting on a model that takes seconds to answer, hosted by a company whose API throws rate limits and the occasional outage you do not control. A few seconds is fine for “write me a poem.” It is not fine sitting inside a checkout, a support reply that a human is waiting on, or anything a user expects to feel like software. You will spend real effort hiding that latency, and some of the time you will fail.

Then cost. Per call it looks like rounding error. Multiply it by every user, every retry, every time someone refreshes, and the long prompts you will inevitably stuff with context to make the answers any good. I have run infrastructure where a feature that costs a fraction of a cent per call still added up to a number that made finance walk over to my desk. Token pricing is a usage tax that scales with your success. The more people love the feature, the more it bleeds. Most teams pricing this out in their heads are off by an order of magnitude, in the wrong direction.

Now the two that actually scare me.

The first is nondeterminism. Ask the same question twice, get two different answers. We have spent our whole careers building systems where the same input gives the same output, and our tests, our debugging, our entire idea of “correct” rests on that. An LLM breaks the assumption underneath all of it. A bug you cannot reproduce is not a bug you can fix. It is a thing that happens sometimes, to some users, and you find out from a screenshot on Twitter.

The second is that it makes things up, fluently. (We have started calling this hallucination, which is a generous word for “confidently wrong.”) It does not know that it does not know. It will invent a policy, a price, a citation, a refund rule, in the same calm tone it uses for the true ones. In a toy it is funny. In a regulated product, attached to your brand, it is a liability with a logo on it.

Here is the part nobody tells you. We do not yet have a clean way to measure whether any of this is working. With a normal model you have a test set and a number. With this, “is the answer good” is a judgment call, and right now most teams are making that call by reading a few outputs and going “yeah, feels right.” That is not an eval. That is a vibe. Until you can put a number on quality, you cannot tell whether a change helped, and you are shipping on faith.

None of this means do not touch it. It means be honest about where the failure lands.

My rule is simple. Put it where the failure mode is survivable. A first draft a human will edit anyway. An internal tool your own staff use and can sanity-check. A summary nobody bets money on. A suggestion the user accepts or ignores, not a decision the system makes for them. In those places a wrong answer costs a shrug, and you get the upside while the technology and your understanding of it both grow up. Wait for a use case like that. They are everywhere if you stop trying to bolt the model onto the most visible, least forgiving surface in the product.

And who should not wait? The few for whom this is the business, not a feature. If your entire product is a writing assistant, a coding helper, a research tool, then the model is not a risk you are adding, it is the thing you are selling, and you should be in this now, learning the sharp edges before someone else does. If you are a content or search company whose users are already drowning and would forgive a rough first version, the upside is worth the mess. For that handful, waiting is the risk.

For everyone else, the honest move is the boring one. Find the place where being wrong is cheap, ship there, and learn. The version of this that ages well is not the team that shipped it into checkout in January. It is the team that waited for the right surface and shipped something that held.