Skip to content
WhySoGeek.
AI

AI Agent Observability in 2026: Tracing, OpenTelemetry, and What to Monitor

Agents fail silently in ways traditional APM never sees. Here is how tracing, OpenTelemetry GenAI conventions, and the 2026 tooling landscape fit together.

Sam Carter 8 min read
Cover image for AI Agent Observability in 2026: Tracing, OpenTelemetry, and What to Monitor
Photo: Ars Electronica / flickr (BY-NC-ND 2.0)

When a web service breaks, your APM tool tells you: latency spiked, the database timed out, here is the stack trace. When an AI agent breaks, none of that fires. The HTTP requests all return 200, the latency looks fine, and yet the agent looped three times, called the wrong tool, retrieved an irrelevant document, and confidently returned a wrong answer. Traditional observability is blind to all of it.

Quick answer

AI agent observability means instrumenting every step of an agent run (each LLM call, tool call, retrieval, and planning decision) as a span inside one trace, then scoring the output for quality and safety. The 2026 standard is OpenTelemetry's GenAI semantic conventions, which attach gen_ai.* attributes like model name and token counts to each span. Start with one OTEL-compatible tool such as Langfuse or MLflow, instrument every LLM and tool call first, and watch cost-per-task and loop counts before you add quality scoring.

Key takeaways

  • Agent observability tracks output quality, faithfulness, safety, and behavioral drift, not just latency and errors, which is what separates it from traditional APM.
  • OpenTelemetry GenAI semantic conventions are the emerging standard: spans carry gen_ai.* attributes like model name, token counts, and finish reason.
  • A single agent run becomes a trace tree of spans: each LLM call, tool invocation, retrieval step, and planning decision is a span you can inspect.
  • Tools cluster into open-source (Langfuse, MLflow, Langtrace) and commercial (Braintrust, Confident AI, AgentOps); MLflow auto-instruments 60+ frameworks.
  • The highest-value signals are cost per task, tool-call accuracy, retrieval relevance, and loop/retry counts, not raw token throughput.

Why agents need their own observability

A web request is mostly linear: in, process, out. An agent run is a tree. The model plans, calls a tool, reads the result, decides to call another tool, retrieves a document, reasons again, and maybe loops. Each of those steps can fail in a way that produces no error code. The retrieval can surface the wrong passage. The tool can return stale data the model trusts. The plan can spiral into a loop that burns tokens without progress.

Agent observability instruments every one of those steps as a span inside one trace. You get end-to-end visibility into the agent's actual decision path: which tools it called, in what order, what each returned, how many tokens each step cost, and where it went off the rails. Crucially, it also tracks things APM never measured, output quality, faithfulness to sources, safety, and drift, because for an agent those are the real failure modes.

A trace tree showing nested spans for LLM calls, tool calls, and retrieval steps
Photo: James St. John / flickr (BY 2.0)

OpenTelemetry is the standard now

The thing that made 2026 different is convergence on a standard. OpenTelemetry's GenAI semantic conventions define how to instrument LLM and agent calls: wrap each API call in a span and attach standardized gen_ai.* attributes, model name, input and output token counts, finish reason, and tool details. Frameworks like LangChain emit these natively, and MLflow auto-instruments more than 60 frameworks (OpenAI Agents SDK, LangGraph, LlamaIndex, CrewAI, Pydantic AI, Anthropic, Bedrock, Google ADK) over OpenTelemetry.

The practical payoff is portability. Instrument once with OTEL and you can send the same traces to whichever backend you choose, Langfuse, Braintrust, Uptrace, or a generic OTEL collector, instead of locking into one vendor's SDK.

# Minimal OpenTelemetry GenAI span around an LLM call
from opentelemetry import trace

tracer = trace.get_tracer("agent")

with tracer.start_as_current_span("llm.chat") as span:
    span.set_attribute("gen_ai.system", "anthropic")
    span.set_attribute("gen_ai.request.model", "claude")
    response = client.messages.create(...)
    span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
    span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)

The tooling landscape in 2026

The market splits roughly into open-source and commercial, with heavy overlap because most commercial tools ingest OTEL.

  • Langfuse, open source, clean trace UI, prompt playground, basic LLM-as-judge scoring, and cost analytics. The common default for self-hosting.
  • MLflow, open source, strongest auto-instrumentation breadth (60+ frameworks).
  • Langtrace, OTEL-native tracing of token counts, duration, and cost.
  • Braintrust, commercial, strong OTEL support and evaluation tooling.
  • Confident AI, commercial, 10+ framework integrations and built-in evals.
  • AgentOps, commercial, agent-run-centric monitoring.

Here is how the main options compare when you are choosing where to send your traces:

ToolTypeBest forCost
LangfuseOpen sourceSelf-hosting with a clean trace UIFree self-hosted, paid cloud tier
MLflowOpen sourceBroadest auto-instrumentation (60+ frameworks)Free
LangtraceOpen sourceOTEL-native token, latency, and cost tracingFree self-hosted
BraintrustCommercialEvaluation workflows alongside tracingPaid, free tier available
Confident AICommercialBuilt-in evals and framework integrationsPaid, free tier available
AgentOpsCommercialAgent-run-centric dashboardsPaid, free tier available

Tip

Start with one OTEL-compatible tool and instrument the boring stuff first: every LLM call and every tool call as spans, with token counts. You can layer quality scoring on later. The trace tree alone solves most "why did the agent do that" mysteries.

What to actually monitor

Token throughput is a vanity metric. The signals that catch real problems are:

    1. Cost per task, total tokens and dollars for a complete agent run, not per call.
    2. Tool-call accuracy, did the agent call the right tool with the right arguments?
    3. Retrieval relevance, did the retrieved documents actually contain the answer?
    4. Loop and retry counts, runaway loops are the classic silent cost and latency killer.
    5. Output quality and faithfulness, scored with evals, often LLM-as-judge.

That last one connects observability to evaluation. Traces tell you what the agent did; evals tell you whether the result was good. The mature setup uses an LLM-as-a-judge evals harness to score faithfulness and relevance on the traces you capture, closing the loop between monitoring and quality. And if your agent's failures trace back to retrieval, the fix usually lives upstream in RAG chunking strategy rather than in the model itself.

Each signal maps to a specific failure mode, which is why infrastructure dashboards miss them:

SignalWhat it catchesWhy APM misses it
Cost per completed taskRunaway token spend across a full runAPM measures per-request latency, not per-task cost
Tool-call accuracyWrong tool or wrong argumentsTool calls return 200 even when the choice was wrong
Retrieval relevanceRAG surfacing the wrong passageRetrieval succeeds at the HTTP level regardless of relevance
Loop and retry countAgent spinning without progressEach loop iteration looks like a healthy request
Output faithfulnessConfident but wrong answersNo error code fires for a fluent hallucination

What to do right now

If you are standing up agent observability for the first time, work through this in order:

  • Pick one OTEL-compatible backend (Langfuse or MLflow if you want free and self-hosted).
  • Wrap every LLM call in a span with gen_ai.* attributes for model and token counts.
  • Instrument every tool call and retrieval step as a child span so you get the full trace tree.
  • Add cost-per-task and loop-count dashboards before anything fancier.
  • Once traces flow, layer in LLM-as-judge scoring for faithfulness and relevance.
  • Set an alert on loop count and cost-per-task so silent token burn pages you, not your invoice.

Frequently asked questions

How is agent observability different from regular APM?

APM tracks infrastructure signals, latency, errors, throughput, which all look healthy even when an agent returns a wrong answer. Agent observability adds output quality, faithfulness, tool-call accuracy, and behavioral drift, the dimensions where agents actually fail.

Do I need OpenTelemetry specifically?

You do not strictly need it, but the OpenTelemetry GenAI conventions are the 2026 standard and give you vendor portability. Instrument once with OTEL and you can switch backends without rewriting instrumentation.

Open source or commercial tool?

Langfuse and MLflow cover most self-hosted needs for free. Commercial tools like Braintrust and Confident AI add deeper eval workflows and managed infrastructure. Many teams start open source and add a commercial layer for evaluation at scale.

What is the single most useful metric?

Cost per completed task plus loop/retry count. Together they catch the most common production problem, an agent silently looping and burning tokens without making progress, which no infrastructure metric reveals.

The takeaway

Agents fail in ways that return HTTP 200, so you cannot monitor them like web services. Instrument every LLM call, tool call, and retrieval as OpenTelemetry spans, watch cost-per-task and loop counts, and pair traces with evals to judge quality. The trace tree alone will answer most "why did it do that" questions you currently cannot.

#ai#observability#agents

Sources & further reading

Keep reading