← Back to The Lab
§ Research NoteMay 17, 202613 min

Pass@1 lies. The reliability framework production agents actually need

Capability and reliability diverge as task duration grows. Benchmarks measure capability. Customers experience reliability. Here is the framework, the math, and the four metrics to ship instead.

ShareXLinkedInFacebook

> ../research/reliability_framework_v1.md

Every agent vendor publishes the same number. "Our agent achieves 87% on SWE-bench Verified." The number is accurate. The number is also misleading, in a specific and measurable way, and the gap between the number and your customer's experience is the gap that turns pilots into post-mortems.

This note is the long form of the third signal in yesterday's weekly post. The Princeton paper "Towards a Science of AI Agent Reliability" (Rabanser, Kapoor, Narayanan et al., February 2026) is the most useful academic contribution to production agent work this year. The argument is simple, and once stated, hard to unsee.

—— The argument in one sentence ——

Machine learning benchmarks evaluate capability, the ability to succeed on a single attempt. Production deployments require reliability, the property of succeeding consistently across repeated invocations on tasks of varying duration. These two properties diverge as task duration grows. Existing benchmarks are structurally blind to the divergence because they report only pass@1 on short, atomic tasks.

—— The number that explains the gap ——

τ-bench published the data point that should be taught to every product manager working on agent products. GPT-4o achieves 61% pass@1 on retail agent tasks. The same agent on the same tasks achieves 25% pass@8.

Translation. If the agent's job is to handle a single retail support ticket once, it succeeds 61% of the time. If the agent's job is to handle eight retail support tickets in a row without a single failure, the probability of clean completion is 25%. The probability of at least one failure across the eight is approximately 75%.

The model is not "61% accurate" in any sense that matters to a customer support workflow. The model is "75% likely to fail somewhere in an eight-step workflow."

This is not a problem of better models. It is a problem of the wrong metric. The compounding is structural. If pass@1 is p, then pass@k under independence is 1 minus (1 minus p) to the k. The math is unfriendly. A 90% pass@1 agent has 65% pass@8 and 35% pass@16. A 95% pass@1 agent has 81% pass@8 and 66% pass@16. A 99% pass@1 agent has 92% pass@8 and 85% pass@16. The difference between 95% and 99% on a single step is the difference between an agent you can deploy at 16 steps and an agent you cannot.

—— Why benchmarks miss this ——

Benchmarks evolved from a research culture that measures capability. The protocol is "give the model the task once, score the output, move on." That is fine for measuring whether a new architecture beats the prior one. It is wrong for measuring whether the model is ready to handle real workflows.

Three structural blind spots.

◆ Atomic tasks. SWE-bench, GAIA, OSWorld, all measure tasks a human could complete in minutes. Real workflows take tens of minutes to hours. Long horizons compound errors. Short benchmarks cannot measure long-horizon failure modes.

◆ Single attempts. Pass@1 captures one trajectory. Production runs the same agent on similar tasks thousands of times per day. Variance across runs is the reliability signal. Pass@1 throws it away.

◆ No perturbation testing. Production data is messy, rephrased, partially malformed. Benchmark inputs are clean. The same agent that hits 87% on benchmark inputs may hit 41% when 12% of inputs have a typo or a missing field.

—— The four metrics to ship instead ——

We use these four metrics in client engagements. None of them require new tooling. All of them require the discipline to run the agent more than once per task.

Metric 1: Pass@k for production k. Choose k to match your real workflow. If your customer expects the agent to handle eight tickets in a row without intervention, your benchmark is pass@8, not pass@1. The number is harder to look at. It is also the number that predicts customer experience.

Metric 2: Variance across runs. Run the agent on the same task 16 times. Report median, p10, p90 cost and latency. An agent that completes a task in 4 seconds at p50 and 90 seconds at p90 is not the same agent at scale as one that completes in 4 seconds at p50 and 5 seconds at p90. The p90 is where your billing and your customer's patience meet.

Metric 3: Perturbation robustness. Take 100 production tasks. Generate 4 perturbations of each, typo, missing field, reordered, paraphrased. Run all 500. The drop from clean to perturbed is the production gap. Anything more than a 15-point drop is a brittleness signal, and the fix is in context engineering, not the model.

Metric 4: Bounded error severity. When the agent fails, how bad is the failure? Did it return "I cannot help with that," bounded. Did it silently return wrong information, unbounded. Did it execute an irreversible action with wrong inputs, catastrophic. A 95% pass@1 agent that fails catastrophically 5% of the time is unshippable. A 92% pass@1 agent that fails gracefully 8% of the time is shippable.

—— What this looks like in a real evaluation ——

We ran an agent for a legal-tech client last month. The bare benchmark number, on their internal task suite, was 89% pass@1. They were ready to ship.

The reliability evaluation found:

Pass@10 was 47%. Their real workflow required 10 sequential clean completions. The model was effectively 53% likely to fail somewhere in a workflow.

p90 latency was 4.2x p50 latency. Their UI timeout was 2x p50. One in ten user sessions was hitting a timeout.

Perturbation drop on missing-field inputs was 23 points. Their real intake forms had missing fields 18% of the time. The model was 23 points worse on nearly a fifth of their traffic.

Three failure modes were unbounded. The model would generate a confident citation to a non-existent statute when its retrieval returned empty. That is the failure that makes the news.

They did not ship. They spent six weeks on context engineering and runtime guards. After the work, pass@10 was 78%, p90 was 1.4x p50, perturbation drop was 6 points, and the unbounded failure modes were caught by runtime checks. Then they shipped.

The model did not change.

—— The framework ——

If you want one figure in this note, it is the four-axis chart we use with every client at evaluation:

— ORBIRESEARCH

ShareXLinkedInFacebook