The instrumentation pattern: an agent you cannot see is an agent you cannot operate, The Lab

Most agents are instrumented for the failures that announce themselves and blind to the ones that do not. Here is the pattern: the three streams every production agent must emit, the question each one answers, and the failures you stay blind to when one is missing.

> ../patterns/instrumentation.md

§ 01 · The blind spot

Most agents are instrumented the way a web service is: log the errors, watch latency, alert on a crash. That catches the failures that announce themselves. The failures that hurt most in production do the opposite. They are confident, well formed, and wrong. The agent that silently misrouted tickets after a model swap threw no error and kept its latency flat the entire time it was getting worse. The agent that double charged customers logged a clean retry. The agent that followed a hidden instruction made one more ordinary looking API call. None of these tripped a guard built for crashes, because none of them was a crash.

This week a cloud platform shipped agent observability as a primitive and, with it, a score for how instrumented your agent is. The market has decided that observability is a measurable property of an agent. The instrumentation pattern is how you earn that score: decide, up front, what an agent must emit so that you can debug it, operate it, and account for it.

§ 02 · Three streams, three questions

A production agent emits three distinct streams. They are not the same data at different verbosity. They answer three different questions, for three different people, at three different times.

◆ Decision traces answer "why did it do that," for the engineer, during a debug.

◆ Behavioral metrics answer "is it still doing the right thing," for the operator, continuously.

◆ The action log answers "what did it actually do, and on whose authority," for the auditor, after the fact.

Miss any one and you go blind to a whole class of failure. Most teams ship the first as raw logs, approximate the second with latency and cost, and never build the third. The pattern is to build all three deliberately.

── Stream one: decision traces ──

§ 03 · Stream one: decision traces

A decision trace is the record of a single agent run as a causal chain: the input it received, the context it assembled, each tool it called with the arguments it passed and the result it got back, the intermediate reasoning that led from one step to the next, and the action it finally took. The unit is the run, and the trace is hierarchical, because in any multi step or multi agent system one run spawns others and you need to follow the call down and back.

The question it answers is "why did it do that," and you only ask that question when something has already gone wrong. So the test of a good trace is reconstruction: can an engineer who was not there replay the decision from the trace alone, without rerunning the agent. If reproducing a bad decision requires running the agent again and hoping it misbehaves the same way, your trace is too thin. Capture the inputs and tool results verbatim, not a summary of them, because the summary throws away the one detail the bug turns on.

The cost is volume, so sample in the healthy case and capture in full on anything that escalated, errored, or crossed a confidence threshold. You want every trace you will actually need, and you will need the ones around the failures.

── Stream two: behavioral metrics ──

§ 04 · Stream two: behavioral metrics

Behavioral metrics are the aggregate shape of what the agent does, over time. Not latency and cost, those are operational metrics and they stay green during the failures that matter. Behavioral metrics measure the decisions: the distribution of outputs across categories, the escalation rate, the retry rate, the rate at which a human overturns the agent in the approval queue, the task success rate where you can measure it, and the share of runs that hit each tool.

The question they answer is "is it still doing the right thing," asked continuously, by whoever is on call. The power of this stream is that it catches the silent regression. When a model changes under the agent, the code is untouched and no error fires, but the distribution of decisions moves, and a moving distribution is an alarm. The triage agent's collapse would have been a Saturday morning alert instead of a four day incident if one number, the share of tickets routed to urgent, had been on a dashboard with a band around it.

Set baselines and alert on deviation, not on absolute thresholds. You rarely know the right absolute value. You always know that a sharp move from last week's value is worth a human look. The overturn rate in your approval queue deserves special attention: a rising overturn rate means the agent and the humans are drifting apart, and that is an early warning that the agent is degrading before any customer feels it.

── Stream three: the action log ──

§ 05 · Stream three: the action log

The action log is the durable, append only record of every consequential, externally visible thing the agent did: every message sent, record changed, transaction posted, job kicked off. Each entry carries what the action was, when it happened, which agent and version took it, on whose behalf, and under what authority, the same fields the identity note argued every action must carry. It is append only because its job is to be trustworthy after the fact, and a log you can edit is not evidence.

The question it answers is "what did it actually do, and on whose authority," asked by an auditor, a compliance team, or you on the worst day, after something has already happened. This is the stream teams skip, because on a good day it does nothing. It exists for the bad day. When an agent does something it should not have, the action log is the difference between "we can show exactly what happened, to whom, and stop it" and "we think it was one of our agents, we are reconstructing." Governance, the line this week's data drew between agents that scale and the majority that get rolled back, lives or dies on this stream.

── How they work together ──

§ 06 · Why you need all three

The three streams cover the three time horizons of operating an agent, and each is blind where the others see.

Behavioral metrics tell you something is wrong, now, in aggregate. They do not tell you why. Decision traces tell you why this specific run went wrong, but only once you know to look, and you know to look because a metric moved. The action log tells you what reached the outside world and who answers for it, which neither of the others is built to prove and an auditor will not accept on the strength of a debug trace. Detection, diagnosis, accountability. Three questions, three streams, no substitutions.

A useful way to see the dependency: the metric is the smoke alarm, the trace is the fire investigator, the action log is the insurance claim. You do not get to pick one.

── The file ──

§ 07 · The OBSERVABILITY.md

Create an OBSERVABILITY.md in the agent repository. It records, for decision traces, what is captured, the sampling rule, and the retention period. For behavioral metrics, the list of metrics, the baseline for each, and the alert band. For the action log, which actions are logged, the fields each entry carries, where it is stored append only, and who can read it. It names the dashboards, the alert targets, and the owner.

This is the document that answers, in one place, the question the platforms now score and buyers now ask: can you see your agent. An agent without it is not unobservable by accident. It is unobservable by omission, and that omission is invisible right up until the day you need to see, and cannot.

── End of pattern ──

◆ Three streams, three questions. Decision traces for why, behavioral metrics for whether it is still right, the action log for what it did and on whose authority. Miss one and you go blind to a class of failure.

◆ Operational health (latency, cost, errors) stays green during the failures that matter. Behavioral metrics, the distribution of decisions over time, are what catch the silent regression.

◆ The action log does nothing on a good day and is the only thing that helps on the worst one. Append only, or it is not evidence.

ORBIRESEARCH