Llama 2 Is Here. Should You Self-Host?

The license is the news, not the weights. Llama 2 came out a few weeks ago with terms that let a company actually run it in production, and within days three different teams asked me the same question: should we stop paying OpenAI by the token and host our own model instead. It is a fair question. It also has a wrong answer most of the time, and the spreadsheet people build to decide it is usually missing the line that matters.

I have run inference both ways. On a marketplace I led we served recommendation and ranking models on our own fleet because the volume justified owning the silicon. We also paid hosted vendors for things that ran a few thousand times a day, because owning a GPU to do that would have been vanity. The decision is not philosophical. It is arithmetic, and then it is the part of the arithmetic that does not fit in a cell.

The naive spreadsheet

Here is the calculation everyone does first. Hosted models bill per token. Self-hosting means you rent or buy a GPU, run the model on it, and the cost is fixed per hour whether you send one request or a million. So you find the crossover.

# rough back-of-envelope, August 2023 numbers, rounded hard on purpose

# hosted: you pay per 1k tokens, in + out, no idle cost
hosted_price_per_1k = 0.002          # cheap hosted tier, blended in+out
tokens_per_request  = 1200           # prompt + completion, a real RAG-ish call

hosted_per_request  = (tokens_per_request / 1000) * hosted_price_per_1k

# self-host: one cloud GPU big enough for a 13B model, on demand
gpu_dollars_per_hour = 1.20          # a single mid-tier accelerator, on-demand
gpu_seconds_per_hour = 3600

# throughput is the number people guess and get wrong.
# a 13B model on one GPU, batched, lands you SOME requests per second.
# call it this, and notice how much the whole decision rides on it:
requests_per_second  = 4.0

self_host_per_request = gpu_dollars_per_hour / (gpu_seconds_per_hour * requests_per_second)

print(f"hosted    : ${hosted_per_request:.5f} / request")
print(f"self-host : ${self_host_per_request:.5f} / request")

# break-even: how many requests per hour before owning the GPU is cheaper
breakeven_rph = gpu_dollars_per_hour / hosted_per_request
print(f"break-even: ~{breakeven_rph:,.0f} requests / hour to beat hosted")

Run that and self-hosting looks like a steal. Per request it is a fraction of a cent, and the break-even comes out to a few thousand requests an hour. If your product does that volume, the math says host. Sign the GPU order.

Do not sign the GPU order yet.

The number the spreadsheet guesses

Look at requests_per_second. I set it to four. I have no idea what yours is, and neither do you until you have load-tested your actual prompts on your actual model on your actual GPU. That one variable swings the entire decision, and it is the one everybody fills in with hope.

Here is the part nobody tells you. Real LLM traffic is bursty and the prompts are uneven. A 200-token request and a 2,000-token request do not cost the same, and they do not batch the same. Your throughput at peak, when the queue is full and latency matters, is a different and worse number than the throughput you measured at 2am with one request in flight. The hosted vendor eats that variance for you. They run a fleet large enough that your burst is noise on their floor. When you self-host, your burst is your problem, and the way you solve it is by buying a GPU you only need for two hours a day and paying for it for twenty-four.

That last point is the one that wrecks the back-of-envelope. Utilization. The spreadsheet assumes the GPU runs flat out. Your traffic has a shape, mornings and evenings and dead nights, and a single GPU sized for your peak sits mostly idle the rest of the day. Idle silicon still bills.

# the honest version: pay for the GPU all hours, serve at real utilization

hours_per_day        = 24
gpu_cost_per_day     = gpu_dollars_per_hour * hours_per_day

# you size for peak but average utilization is much lower.
# 30% is generous for a single-model, single-GPU setup with bursty traffic.
avg_utilization      = 0.30
effective_rps        = requests_per_second * avg_utilization
requests_per_day     = effective_rps * 3600 * hours_per_day

self_host_real_cost  = gpu_cost_per_day / requests_per_day
hosted_real_cost     = hosted_per_request

print(f"self-host (real util): ${self_host_real_cost:.5f} / request")
print(f"hosted               : ${hosted_real_cost:.5f} / request")
# the gap closes fast, and that is before a single human has been paid

Same GPU, same model, one honest assumption changed, and the cliff-edge advantage shrinks to a slope. At low utilization, hosted can be cheaper than your own hardware even though your per-token cost on paper is lower. (I have watched a team discover this in month three, after the hardware was bought.)

The line that never makes it into the cell

Now the cost that does not appear in any spreadsheet, because it is a salary, not a unit price. Someone has to run this thing.

When the hosted API has an incident, you read their status page and wait. It is annoying and it is not your pager. When your self-hosted model falls over at midnight because a long prompt OOMed the GPU, or the driver wedged after a kernel update, or the batching server deadlocked under load, that is your engineer awake at midnight. Self-hosting an LLM is not a capex decision. It is a decision to staff an on-call rotation for a new production system that touches every request your product makes.

Price that honestly and the math moves again:

# the tax nobody puts in the model

# you need real coverage. one person is not a rotation, it is a hostage.
# call it a slice of a few engineers' time to keep one model fleet healthy.
ops_loaded_cost_per_month = 12000     # a fraction of headcount, loaded, rounded
days_per_month            = 30
ops_cost_per_day          = ops_loaded_cost_per_month / days_per_month

total_self_host_per_day   = gpu_cost_per_day + ops_cost_per_day
all_in_per_request        = total_self_host_per_day / requests_per_day

print(f"GPU only      : ${gpu_cost_per_day:.2f} / day")
print(f"GPU + ops      : ${total_self_host_per_day:.2f} / day")
print(f"all-in cost    : ${all_in_per_request:.5f} / request")
# now compare THAT to hosted. the volume you need to win just went way up.

The ops line does not scale with traffic, which cuts both ways. At enormous volume it disappears into the per-request cost and self-hosting wins clean. At modest volume it dominates, and you are paying a few engineers to save less than the hosted bill would have been. The break-even did not move a little when I added the human. It moved by an order of magnitude.

And the model is not the same model

I have been pretending the only difference is price. It is not. Llama 2 is a genuinely good open model, the best open weights anyone has had to work with. It is not GPT-4. On hard reasoning, on long instructions, on the messy edge cases that show up in production but never in the demo, the gap is real and you will feel it. For some workloads that gap is fine. Classification, extraction, summarizing structured text, the narrow well-scoped jobs, Llama 2 holds its own and the privacy of keeping data on your own hardware is worth real money on its own. For open-ended reasoning where quality is the product, you are trading down, and you should know that you are doing it on purpose.

There is also the scarcity tax. Right now, the week I am writing this, the good accelerators are hard to get at any sane price. On-demand capacity blinks in and out, reserved capacity wants a year of commitment, and the queue for the newest chips is measured in quarters. Owning your inference means owning that procurement problem too.

The rule I actually use

So here is the decision, and it is not “self-host because open weights are exciting.”

Self-host when three things are true at once: your volume is high enough that the per-request math wins even after you price the on-call rotation honestly, your workload is narrow enough that Llama 2’s quality clears the bar, and you have a real reason to keep the data in-house. Two of three is not enough. High volume on a task that needs GPT-4’s reasoning means you self-host and ship a worse product to save money, which is a way to lose slowly. A privacy requirement at low volume means you pay a small fortune in idle GPUs and pager duty to avoid a hosted contract you could have negotiated a data-handling clause into.

Pay the hosted API until your own numbers, measured not guessed, tell you to stop. The week a model drops is the worst week to make this call, because the excitement is loudest and the data is thinnest. Build on the API, instrument every call, and let your real token volume and your real latency requirements decide it for you a quarter from now.

The weights being free does not make the GPUs free, and it definitely does not make the 2am page free. Whose pager is it going to be?