A team came to me last month wanting GPU budget to fine-tune a model on their support knowledge base. Forty thousand articles, updated daily, and they wanted the model to “know” all of it. I asked one question: what happens when an article changes? Silence. Then someone said, well, we would retrain. How often, I asked. Every night, they said, and I watched the room slowly realize they had just described a retrieval problem with a training bill attached.
This is the most common mistake I see, and it is expensive in three ways at once. People reach for fine-tuning because it sounds like the serious option. Retrieval feels like plumbing. Fine-tuning feels like machine learning. So the instinct, especially from people who came up doing ML the old way, is to assume the real answer involves gradients. It usually does not.
The framework is three questions, not a vibe
Most advice on this collapses into “it depends,” which is true and useless. Here is what it actually depends on. Three questions, asked in order, the first one settling most cases.
Is it knowledge or behavior? This is the whole ballgame. Fine-tuning changes how a model behaves. Retrieval changes what a model knows. If you want the model to have facts it does not currently have, your support articles, your product catalog, your internal policies, that is knowledge, and you retrieve knowledge. You do not bake it into weights. Fine-tuning teaches a model to act a certain way, adopt a tone, follow a structure. It is genuinely bad at teaching a model that the return policy changed on Tuesday. People conflate these constantly because both feel like “making the model better at our thing,” but only one of them is fine-tuning.
How often does the knowledge change? Even on the knowledge side, the rate of change tells you how much you would regret freezing it into a model. A taxonomy that updates twice a year is one thing. A catalog that changes hourly is another. Anything that moves faster than your retraining cadence cannot be fine-tuned without being permanently stale, and almost everything interesting in a real business moves faster than you want to retrain.
Do you need provenance? Nobody asks this until an auditor or an angry customer forces it. If you have to show where an answer came from, cite the source, prove the model did not invent it, you need retrieval, because a fine-tuned model cannot tell you which training example produced a given sentence. The knowledge is smeared across billions of parameters with no return address. In regulated work, payments, healthcare, anything with a compliance officer, this question alone ends the debate. I have sat in those reviews. “The model just knows” is not an answer you can give a regulator.
Run those three in order and you find that the thing most teams want to fine-tune is knowledge, that changes often, that they will eventually need to cite. Three for three on retrieval.
Where fine-tuning actually earns it
Now the honest part, because if I only told you to retrieve I would be selling the same one-size answer I am complaining about. There are real cases where fine-tuning wins.
Format, style, and tone. The cleanest win. If you need every output to come back as a particular JSON shape, or in a specific house voice, or following a rigid template, that is behavior, and behavior is what fine-tuning is for. You can fight this with prompting and few-shot examples for a while, but past a certain consistency bar, a small fine-tuned model holds the format more reliably than a large prompted one, at a fraction of the token cost because you stop paying for the instructions on every call.
Narrow classification. If the job is to sort inputs into a fixed set of buckets, route this ticket, flag this transaction, label this document, a fine-tuned small model is often better and dramatically cheaper than a frontier model with a clever prompt. The task is closed. The labels do not change daily. There is nothing to retrieve, just a decision boundary you want the model to learn. This is fine-tuning doing exactly what it is good at.
Latency and cost on a fixed task. When you have one well-defined job running at volume, the economics flip. A retrieval call adds an embedding lookup, a vector search, and a fat context window on every request. A fine-tuned small model with the behavior baked in answers in one short hop. At low volume the difference is noise. At high volume on a task that does not change, that difference is the whole budget.
Here is the part nobody tells you. These cases share a property. They are all about behavior on a stable task, never about supplying fresh, citable knowledge. The moment your classification task needs to reference a document that changed yesterday, you are back in retrieval territory and the fine-tune was a detour.
Why teams fine-tune anyway
So why does the GPU-budget request keep landing on my desk? Because fine-tuning lets you avoid building good retrieval, and good retrieval is genuinely hard. Chunking, embedding, the eval loop, the freshness pipeline, the permission filtering, none of it glamorous and all of it fiddly. Fine-tuning feels like you can hand the problem to the training run and walk away. Throw the documents at the model, let it learn them, done.
It is not done. It is deferred, and it comes back worse. The model is stale the day after you train it. It hallucinates facts in the gap between training examples, confidently, because confident wrong answers are the one thing these models reliably produce. You cannot cite anything. And when the knowledge updates, you are back at the GPU queue retraining a model that will be stale again by the weekend. (The same team usually fine-tunes a second time, certain the first run just needed more epochs. It did not.)
I have watched smart teams spend a quarter learning this the slow way. The fine-tune that was supposed to save them retrieval work ends up next to a retrieval system they had to build anyway, because they finally needed citations or freshness. Now they maintain both.
The reframe that helps: fine-tuning and retrieval are not competitors. Retrieval handles what the model needs to know right now. Fine-tuning handles how the model should behave, persistently, on a task that holds still. The best systems I have built use both, retrieval for the knowledge, a light fine-tune for the format, each for the thing it is actually good at.
The mistake is not choosing fine-tuning. It is choosing it to dodge the retrieval work, then doing the retrieval work anyway, six months late, after the first answer with a fabricated citation goes out to a customer.
So before you ask for the GPUs, ask the three questions. Knowledge or behavior. How fast does it change. Do you need to prove where it came from. If you still land on fine-tuning, good, you have a real reason. Most teams do not get there, and the ones who skip the questions are the ones who pay twice.