The LLM Observability Stack I Wish I'd Built Sooner

The incident lasted two days and the dashboards stayed green the entire time.

Our APM was perfect. CPU flat, memory flat, p99 latency on the API a healthy couple of hundred milliseconds, error rate at zero. Meanwhile an agent in a financial-ops product was quietly handing users the wrong account totals, because a tool call upstream had started returning stale data and the model wrapped that staleness in a fluent, confident paragraph. No exception. No 500. Nothing a load balancer or a tracing span on the HTTP layer would ever flag. The system was healthy and wrong at the same time, and traditional monitoring has no word for that.

I built the observability stack I’m about to describe after that incident. Every line of it exists because something broke and I had no way to see why. Build it before.

Why APM is necessary but not sufficient

Application performance monitoring answers “is the service up and fast.” That question still matters for an LLM app. You still have a web tier, a queue, a database, and all the ordinary ways those fall over. Keep your APM.

But APM treats the model call as one opaque box that takes some time and returns some bytes. It cannot tell you that the prompt was missing a retrieved document, that the model spent four hundred tokens apologizing before answering, that a tool returned an error the agent decided to ignore, or that the answer was confidently false. The interesting failures in these systems are semantic, not operational. A 200 with a wrong answer is the default failure mode now, and your green dashboard is blind to it.

Here is the part nobody tells you. The thing you most need to capture is the full request and response, verbatim, including every intermediate tool call and the arguments the model chose. Not a sample. Not a redacted summary. The whole conversation, because when something goes wrong the only way to understand it is to read exactly what the model saw and exactly what it did. Most teams skip this for cost or privacy reasons and then spend an incident trying to reconstruct a trace from log fragments. (I did, for two days.)

The five things to instrument

There are five signals. Miss any one and you get a class of incident you cannot diagnose.

Full request and response, with tool calls. The verbatim prompt, the system message, the retrieved context, the model’s output, and every tool the agent invoked along the way with its inputs and outputs. This is the trace. It’s the difference between “the agent gave a wrong number” and “the agent called get_balance which returned data from a cache that was nine hours stale.”

Tokens and cost per request. Input tokens, output tokens, and the dollar figure, attached to the trace and aggregatable by feature, user, and tenant. Without this, agent cost is a single line on a monthly bill that you cannot attribute to anything. A looping agent can quietly cost ten times a chat turn, and you want to know which feature is doing it before finance does.

Latency, broken down by stage. Not the end-to-end number. The breakdown: retrieval, each model call, each tool call, any reranking or post-processing. A slow agent is usually slow in one specific place, and the aggregate hides it.

Eval scores, in production. This is the one people treat as a pre-deploy gate and then forget. You want graded judgments on a sample of live traffic, continuously. Faithfulness to the retrieved context, format correctness, refusal rate, whatever quality means for your task. Offline evals tell you the system was good last Tuesday on your test set. Production evals tell you it’s good right now on real inputs, which drift.

Captured failures as new eval cases. Every real failure is a gift. When a user thumbs-down an answer, or a guardrail trips, or you find a bad trace, that exact input becomes a permanent eval case. This is the loop that compounds. Your eval set grows from real failures instead of your imagination, and a regression you fixed once can never silently come back.

The shape of it

The model and tools sit in the request path. The observability sinks sit alongside, fed from a trace context that wraps the whole turn. Nothing in the sink path is allowed to slow down or break the user’s request.

A request flowing through app, model, and tools, with trace, eval, and cost sinks fed from a trace context alongside the path

The request path is synchronous and on the critical path. The trace, eval, and cost sinks hang off it and run async, so observability never becomes the thing that takes the system down.

The important design choice is that line between the path and the sinks. Emitting a trace, scoring it, and recording its cost must never block the response or fail the request. You buffer and ship asynchronously. If your observability backend has an outage, users should never notice. I have seen a logging call inside a request path take a service down, and an observability layer that can take down the thing it’s observing is worse than none.

What the instrumentation actually looks like

I wrap the whole turn in one trace context, then let each step attach a span. The wrapper owns timing, token math, and shipping. Real code, lightly trimmed.

@dataclass
class Span:
    name: str
    started: float
    ended: float | None = None
    tokens_in: int = 0
    tokens_out: int = 0
    meta: dict = field(default_factory=dict)

class Trace:
    def __init__(self, request_id, user_id, feature):
        self.request_id = request_id
        self.user_id = user_id
        self.feature = feature
        self.spans: list[Span] = []
        self.t0 = time.monotonic()

    @contextmanager
    def step(self, name, **meta):
        s = Span(name=name, started=time.monotonic(), meta=meta)
        try:
            yield s
        finally:
            s.ended = time.monotonic()
            self.spans.append(s)

    def cost_usd(self):
        # priced per-model; the rates live in one table, not scattered
        # across the codebase where they rot the day a price changes
        return sum(price(s.meta.get("model"), s.tokens_in, s.tokens_out)
                   for s in self.spans)

    def ship(self):
        # fire-and-forget onto a queue. NEVER await this in the request
        # path. if the sink is down the user must not feel it.
        sink.enqueue(self.to_record())

Using it reads like the request it describes. Every model and tool call is a step, so latency falls out per stage for free, and the full prompt and response live in the span meta where an incident can find them.

def answer(trace: Trace, question: str):
    with trace.step("retrieve") as s:
        docs = retriever.search(question, k=6)
        s.meta["doc_ids"] = [d.id for d in docs]   # not the bodies; ids are enough to replay

    with trace.step("generate", model="primary") as s:
        prompt = build_prompt(question, docs)
        out = llm.complete(prompt)
        s.tokens_in, s.tokens_out = out.usage.prompt, out.usage.completion
        s.meta["prompt"] = prompt          # verbatim. the whole point.
        s.meta["completion"] = out.text

    # grade a slice of live traffic. async, sampled, off the hot path.
    if sample(rate=0.05):
        eval_queue.enqueue(grade_faithfulness, trace.request_id, docs, out.text)

    trace.ship()
    return out.text

The eval here is a separate consumer. It pulls a sampled trace, runs a judge (a cheaper model scoring faithfulness against the retrieved docs, or a deterministic check for format), and writes the score back keyed on the request ID. When a score comes back low, the trace is already sitting in the sink, fully reconstructable, with the prompt and the docs and the tool calls right there. And the bad ones get promoted into the eval set, which is the only part of this that makes tomorrow’s system better than today’s.

The regret, stated plainly

Everything above is a few days of work. It is not a research project and it is not a vendor you need to buy, though several good ones exist now. It is a trace context, three async sinks, and the discipline to capture the whole conversation instead of a tidy summary of it.

I built it the week after a two-day incident that a single readable trace would have closed in twenty minutes. The cost of building it before would have been a few days. The cost of building it after was the incident, plus the days, plus the trust we spent explaining to a customer why a system that looked perfectly healthy had been wrong the whole time.

What’s the failure you can’t currently see in your own system, and would you know it was happening?