Small Fine-Tuned Models Are Beating Frontier on My Workloads

We were spending more on one classifier than on the rest of the platform’s inference combined. It was a transaction categorizer, the least glamorous thing we run, sorting line items into a fixed set of buckets, millions of calls a day, and every single one of them was hitting a frontier model over the network because that is what the prototype did and nobody had gone back to fix it.

The task did not need a frontier model. It never did. It needed a model that knew our forty-odd categories cold and answered in single-digit milliseconds. So we built one. A small open-weight model, fine-tuned on examples the frontier model itself generated, and it is now beating that frontier model on this workload: same accuracy on our eval set, a fraction of the cost, and latency low enough that I stopped getting paged about it.

This post is about where that trade works, where it absolutely does not, and the pipeline in between. Including the part of the bill that does not show up until month three.

Where the small model wins

The pattern holds in one specific shape: a narrow task, a fixed output space, and high volume. Three things at once. Drop any one of them and the math changes.

Classification is the cleanest case. So is structured extraction, pulling the same fifteen fields off a document type you see a thousand times a day. So is narration in a fixed house style, turning a row of numbers into a sentence that always reads the same way. These tasks have a small, knowable output distribution. You can write down what “right” looks like, which means you can fine-tune toward it and measure when you get there.

Here is the part nobody tells you. The reason fine-tuning works on these tasks is not that the small model gets smarter. It is that the task gets narrower than the model’s general capability, and you are paying frontier prices for capability you are throwing away on every call. A frontier model categorizing a coffee purchase is a concert pianist playing chopsticks. It will nail it. You are just renting the wrong instrument.

Where it does not work: anything that needs broad reasoning, open-ended knowledge, or a long tail you cannot enumerate. The moment a task starts requiring the model to know things you did not put in front of it, or to reason across a problem you have not seen the shape of, the small model falls off a cliff and the frontier model is worth every cent. I have watched a team try to fine-tune a small model into a general support agent. It is a graveyard. The output space is the whole of human conversation, the eval bar is “vibes,” and you spend six months building something a frontier model does on day one.

So the question I ask first, before any of the pipeline below: can I write the eval? If I can write down what correct output looks like for this task, it is probably a fine-tune candidate. If “correct” is a feeling, it is not.

The pipeline: distill, fine-tune, hold a bar

The whole thing is three moves. Use the frontier model to build training data. Fine-tune the small model on it. Hold an eval bar so you know when you are done and, more importantly, when you have regressed.

The distillation step is where most of the value is, and it is almost embarrassingly direct. You already have a frontier model doing the task in production. So you log its inputs and outputs, clean them, and that becomes your training set. The expensive model teaches the cheap one, once, and then you stop paying the expensive one.

# Build a fine-tune set by distilling our prod frontier model.
# Real rule we learned the hard way: do NOT trust the teacher blindly.
# A frontier model is right ~95% of the time on this task, which means
# ~1 in 20 of your training labels is wrong unless you filter.

def build_training_rows(logged_calls, min_confidence=0.85):
    rows = []
    for call in logged_calls:
        label = call.frontier_output          # the "teacher" answer
        if label.category not in VALID_CATEGORIES:
            continue                           # teacher hallucinated a bucket
        if label.confidence < min_confidence:  # teacher itself was unsure
            continue                           # send these to human review instead
        rows.append({
            "messages": [
                {"role": "system", "content": CATEGORIZER_PROMPT},
                {"role": "user", "content": call.line_item_text},
                {"role": "assistant", "content": label.category},
            ]
        })
    return dedupe_on_input(rows)               # near-dupes inflate your eval lift and lie to you

That confidence filter and the dedupe are not optional polish. The first version of this skipped both, and we fine-tuned a model that was confidently wrong on exactly the cases the teacher was confidently wrong on, plus an eval score that looked great because half the test set was duplicated into the training set. (We caught it. It was not a fun week.)

Then you fine-tune. In 2025 this is the boring part, which is a good thing. A few thousand to a few tens of thousands of clean examples, a parameter-efficient fine-tune so you are training adapters and not the whole model, and a small open-weight base. The training run is cheap and short. If your training run is long and expensive, you are probably either using too big a base or trying to teach the model knowledge it should be retrieving, not memorizing.

The part that actually decides whether you ship is the eval. Not the training. The eval.

# The bar. A change ships only if it clears the frontier baseline
# on OUR data, not on a public benchmark we don't run on.

def evaluate(model, gold_set):
    correct, frontier_correct = 0, 0
    by_category = defaultdict(lambda: [0, 0])  # [right, total] per bucket
    for ex in gold_set:
        pred = model.classify(ex.text)
        if pred == ex.label:
            correct += 1
            by_category[ex.label][0] += 1
        by_category[ex.label][1] += 1
        if ex.frontier_label == ex.label:      # baseline, scored once, frozen
            frontier_correct += 1

    # Aggregate accuracy hides the failure that gets you fired:
    # a small category that the fine-tune quietly forgot.
    worst = min(by_category.items(), key=lambda kv: kv[1][0] / max(kv[1][1], 1))
    return {
        "accuracy": correct / len(gold_set),
        "frontier_accuracy": frontier_correct / len(gold_set),
        "worst_category": worst[0],
        "worst_category_accuracy": worst[1][0] / max(worst[1][1], 1),
    }

That per-category breakdown is the line I care about most. Aggregate accuracy will tell you the model is fine while it has silently gone blind to your three rarest categories, which are usually the high-value ones (fraud-adjacent buckets, in our case). A fine-tune that improves the average by getting better at the common case and worse at the rare one is a regression dressed as a win.

The gold set is a few hundred hand-checked examples that never touch training. Frozen. Every candidate model runs against it, and the frontier baseline runs against it once and stays pinned as the line to beat. If the small model does not clear that line on our data, it does not ship, full stop. We are not chasing a leaderboard. We are beating one specific model on one specific job.

The maintenance bill

Now the honest part, because every pitch for fine-tuning skips it.

When you call a frontier API, somebody else owns the model. They retrain it, they fix it, they keep it current, and you get the upgrades for free. The moment you fine-tune your own, you own a model. You own it the way you own a service: it can rot, and it will, just silently.

The rot is drift. Your input distribution moves. New merchant types appear, new document formats, new phrasings, a whole category of transactions that did not exist when you built the training set. The frontier model would shrug and handle it because it generalizes. Your narrow little specialist has never seen it and guesses, confidently, wrong. Nothing crashes. Accuracy just bleeds out a basis point at a time until someone downstream notices the numbers are off.

So owning the model means owning the loop that keeps it honest: production monitoring on the live distribution, a steady trickle of fresh examples back through the teacher, periodic re-evaluation against a gold set you also have to keep current, and a retraining cadence. That is real engineering time, forever, and it is the cost you weigh against the inference savings. For a low-volume task it never pays back. The whole case rests on volume: at millions of calls a day the inference savings dwarf the maintenance, and at a few thousand a day they do not come close.

I would still make the same call on this workload tomorrow. But I make it now with the maintenance line written into the proposal, next to the savings, in the same font. The teams that get burned are the ones who saw the cost-per-call drop, declared victory, and walked away from a model that needed feeding.

The small model is not winning because it is better. It is winning because the task was always smaller than the tool, and we finally bought a tool the right size. Do you actually know which of your workloads are that shape?