What happened in agent engineering this week, The Lab

Anthropic's tool-use update, a quiet retrieval paper worth your weekend, and three production postmortems from the field.

This week brought three things worth your attention: a meaningful update to how Claude handles tool calls, a research paper that reframes how agents should retrieve information, and a pattern emerging from production failures that keeps showing up in different systems.

1. Anthropic tightened tool-use behavior in Claude

Anthropic shipped an update to how Claude handles tool calls in multi-step workflows. The change is subtle but important: the model now validates tool output schemas more strictly before proceeding to the next step.

What this means in practice: if your tool returns a slightly malformed response, say a missing field or an unexpected null, Claude will now surface that mismatch rather than silently working around it. Previously, it would often "fill in the blanks" with assumptions. That felt smooth in demos but caused cascading errors in production.

If you have agents running on Claude with tool chains longer than 3 steps, audit your tool output schemas this week. Anything loose will now get flagged instead of hidden.

Implication for your architecture: This reinforces why tool contracts matter. If you already define explicit output types and required fields for every tool, this update helps you. If you don't, you'll start seeing failures you weren't seeing before, which is actually a good thing. Visible failures are fixable failures.

2. A retrieval paper worth reading this weekend

A team from Stanford published work on what they call "evidence-gated retrieval", the idea that an agent should not just retrieve information, but should explicitly decide whether the evidence it found meets a minimum threshold before acting on it.

This isn't new in principle. But the paper formalizes something most production systems do poorly: the gap between "I found something" and "what I found is good enough to act on." Most agents treat any search result as valid input. This paper proposes a structured evidence gate, a checkpoint where the agent evaluates source quality, contradiction risk, and confidence level before proceeding.

If you're building research agents or any agent that makes decisions based on retrieved information, this is directly applicable. The framework maps cleanly onto a retrieval skill with built-in confidence scoring.

Implication for your architecture: This is exactly what we cover in Miro Frame 06 (Retrieval & Evidence Flow), minimum evidence thresholds, signal vs noise classification, and confidence tagging. If your agent doesn't have these, retrieved information flows straight into decisions without any quality gate.

3. Three production postmortems, same root cause

This week I reviewed three unrelated agent failures from different systems. A content agent that published duplicate posts. A research agent that wrote contradictory results to staging. A scheduling agent that double-booked the same time slot.

Different agents, different companies, different stacks. Same root cause: no idempotency check before the write operation.

All three agents validated their input correctly. All three processed information correctly. All three failed at the last step, writing the result, because they didn't check whether that exact result already existed.

The fix in all three cases: Add a deduplication check as the final validation step before any write. Compare the proposed write against existing records by key identifiers (URL, timestamp, entity name, whatever makes a result unique in your domain). If a match exists, skip the write and log it as a duplicate, not an error.

This is why we require a deduplication step in every workflow that includes a write operation. It's not glamorous work. It prevents the most common production failure I see.

This week's signal summary

· Tool output validation is getting stricter. Audit your tool contracts.

· Evidence-gated retrieval is worth implementing. Don't let search results flow into decisions unchecked.

· Deduplication before writes prevents the most common production failure. Three systems proved it again this week.

See you next Friday.