§ Research Files
Research notes from production.
Three signals from this week: a new dataset quantifies the pilot-to-production gap for the first time. Five intelligence agencies release joint guidance on agentic AI in critical infrastructure. And the agent communication protocol that every major vendor now implements has a permanent home.
Every production agent system has a trust boundary. Most teams never defined theirs explicitly. Here is the framework for deciding where it sits, the four most common placements, and the failure modes that emerge when the boundary drifts.
A four-agent pipeline where every individual agent behaved correctly, but the system produced a confident, completely false output. The failure mode is structural, not model-level. Here is the trace, the mechanism, and the three controls that prevent it.
Google and Anthropic shipped identical product category names on the same day with opposite infrastructure postures. Cursor Composer 2.5 landed at one-tenth the frontier price. The Big Four accounting firms completed their AI stack alignment. Three signals from the most consequential week in agent engineering so far.
Capability and reliability diverge as task duration grows. Benchmarks measure capability. Customers experience reliability. Here is the framework, the math, and the four metrics to ship instead.
Three signals from this week. The protocol everyone bet on has its first real security incident. The standards body finally moves from model governance to agent identity. And an academic group puts a name on what every production team has been measuring quietly.
Prompt engineering optimizes a single question. Context engineering decides what the agent can see, when, and at what resolution. Here is the framework for choosing altitude, the four most common failure modes, and the file structure that holds up under audit.
A real postmortem from a framework user who left for a wedding on a Friday. The agent looped from Friday night until Monday morning. The fix is two lines of code. The lesson is older than agents.
Three signals worth your attention this week: model-native agent harnesses are arriving, an academic consortium quantifies the security gap, and a new standard for agent threats is now public.
A production agent without a kill switch is not a production agent. The architecture pattern, the team protocol, and the failure modes a kill switch must handle.
A real postmortem from a vibe-coding session that went wrong. Why "freeze the code" is not a permission boundary, and the three controls that would have prevented this entirely.
A G2 report reveals orchestration as the top scaling bottleneck. NASA researchers publish a methodology that mirrors our architecture. And MCP crosses 10,000 servers, here's what that means for your tool contracts.
Least privilege is not a security buzzword. It's the difference between an agent that fails safely and one that deletes your production database at 3 AM.
Datadog's State of AI Engineering report dropped hard data. Stanford measured real agent performance. And the most common production error isn't what you think.
Anthropic's tool-use update, a quiet retrieval paper worth your weekend, and three production postmortems from the field.
The ICP scoring looked perfect on paper. But it scored engagement, not authenticity. Here's how fake profiles gamed the scoring model and what I changed.
Not seven categories. Seven specific, concrete test scenarios, with expected outcomes defined before you run them. No exceptions.
Most agents remember everything or nothing. Both are wrong. Here's the framework for deciding what to store, what to forget, and how long information stays relevant.
A naive exponential backoff was hiding a deeper concurrency leak. Here's the trace, the fix, and the regression test that now catches it.
One typed schema, one error taxonomy, one idempotency convention. The discipline that turns demo agents into shippable systems.
Line by line through the SOUL.md powering a legal-tech agent in production. The decisions, the trade-offs, the things we'd change.