← Back to The Lab
§ Failure ReportJune 18, 20269 min

The agent that quietly got worse over a weekend, and the model swap nobody logged

A triage agent started misrouting urgent tickets. No error fired. The outputs were well formed and plausible. The cause was a model version that changed under a floating alias, and nothing was watching the one thing that would have caught it. Here is the trace and the three controls.

ShareXLinkedInFacebook

> ../failures/silent_model_regression.md

§ 01 · What happened

A support triage agent read inbound tickets and routed each one to a queue: billing, technical, urgent, or general. It had run for three months at a routing accuracy the team was happy with. It produced clean, well formed output every time: a queue name and a one line reason. Nothing about it ever threw an error.

Over a weekend, the agent began sending a specific class of urgent tickets, the ones where a customer described a service outage in calm, non urgent language, to the general queue instead. The tickets were not malformed. The routing decisions were not crazy. Each one, read on its own, looked like a defensible call. They were simply wrong, and they were wrong in a pattern.

By the time a team lead noticed that several outage reports had sat in the general queue for two days, the SLA breaches had already happened. The customers had already waited. The agent had been quietly worse for four days, and every log said it was healthy.

── The thing that changed ──

§ 02 · What actually changed: the model under the alias

The agent's code did not change that weekend. No one deployed. The prompt was untouched. What changed was the model.

The agent called the provider through a floating alias, the kind that always points at the current recommended version. That weekend the provider rolled the alias from one model version to the next. The new model was, on every public benchmark, better than the old one. It was also, on this team's specific instruction, subtly different. The old model had read "route by stated urgency and by impact" and weighted impact heavily. The new model weighted the customer's stated tone more. A calm description of an outage now read, to the new model, as low urgency.

Nobody chose this. Nobody logged it. The team did not know the model had changed until they went looking for what was different, and the only thing that was different was a version string they never controlled and never recorded.

── Why nothing caught it ──

§ 03 · Why every guard the team had stayed silent

This team was not careless. They had guards. The guards were the wrong shape for this failure.

They validated the output schema. The agent always returned a valid queue name and reason, so schema validation passed every time. A valid answer is not a correct answer, and schema validation cannot tell the difference.

They alerted on errors. There were no errors. A misroute is not an exception. It is a confident, well formed, wrong decision, and confident wrong decisions do not raise errors.

They watched latency and cost. Both were normal. The new model was about as fast and about as cheap. Operational health was green the entire time the agent was making worse decisions.

What they did not have was a single measurement of behavior over time. Nothing tracked the distribution of routing decisions week to week. If anything had been watching the share of tickets going to the urgent queue, it would have seen that share drop sharply on Saturday and stay down. That drop was the failure, visible from the first hour, and nobody was looking at the number that showed it.

── The mechanism ──

§ 04 · The mechanism: a moving dependency and a blind spot

Two things combined. First, a production agent depended on a model through a floating alias, which means the most important dependency in the system could change without a deploy, without a version record, and without anyone's consent. Second, the agent was instrumented for the wrong failures. It watched for malformed output and crashes, the failures that announce themselves, and it watched nothing that would reveal a silent shift in the quality of its decisions.

This is the same shape as the lost confirmation in the double charge report and the trusted tool output in the hijack report: a wrong assumption sitting between correct steps. Here the assumption is "the model is a fixed thing." It is not. In production a model is a dependency like any other, and an unpinned dependency that can change behavior under you is a liability, not a convenience.

── The fix ──

§ 05 · The fix, in three controls

Control one: pin the model version. Never run production on a floating alias. Production agents reference an exact model version, recorded in config and in version control. A model upgrade becomes a deliberate change you make, review, and can roll back, not something that happens to you on a Saturday. The floating alias is fine in a notebook. It has no place in a system that makes decisions customers feel.

Control two: gate every model change behind the eval set. When you do upgrade, run your frozen eval set against the new version before it touches production, and compare the results to the current version, not just to a pass threshold. The eval set is the spec. A new model that scores worse on your specific instructions is a regression for you even when it scores better on the public leaderboard. Promote on your evidence, not the provider's.

Control three: instrument behavior, not just health. Track the distribution of the agent's decisions over time. For this agent that is the share of tickets going to each queue, week to week. A sharp move in that distribution is an event, whether or not a deploy happened, whether or not an error fired. This is the measurement that would have turned four silent days into a Saturday morning alert. The next pattern note builds this into a general instrumentation discipline, because this blind spot is the most common one in production agents.

── The control that closes it ──

§ 06 · The canary that now runs

A scheduled job runs the frozen eval set against the live model once a day and on any detected version change, and it diffs the score against the last known good run. A second monitor watches the routing distribution and alerts when any queue's share moves beyond a set band. The day the alias rolls again, the canary catches the behavior change before a customer does. The cost of both is small. The cost of not having them was measured in SLA breaches and trust.

── End of report ──

A model reached through a floating alias is an unpinned dependency that can change behavior with no deploy and no record. Pin the version in production.

Schema validation, error alerts, latency, and cost all stay green during a silent quality regression. None of them measure whether the agent is still making the right decisions.

Track the distribution of the agent's decisions over time. A shift in that distribution is the alarm that fires when everything else says healthy.

ORBIRESEARCH

ShareXLinkedInFacebook