Tool contracts: the single file that prevents 80% of production bugs, The Lab

One typed schema, one error taxonomy, one idempotency convention. The discipline that turns demo agents into shippable systems.

Every production agent bug I've investigated in the past six months traces back to one of three causes: the agent sent the wrong input to a tool, the agent misinterpreted the tool's output, or the agent didn't handle the tool's failure correctly.

All three are preventable with a single document: a tool contract.

A tool contract is not documentation. Documentation tells you how a tool works. A tool contract tells the agent how to use the tool without making mistakes. The difference is the audience, documentation is for humans, contracts are for the system that governs agent behavior.

What a tool contract contains

Every tool contract in our system has 15 fields. Not because we like bureaucracy, but because every field prevents a specific category of bug.

Tool Name and Category. This seems trivial until you have 12 tools and two of them have similar names. "Search" is not a tool name. "Tavily Web Search" is. Category (Search, Extract, DB Write, Delivery, Memory) lets you group tools by function and apply category-level rules.

Purpose. One sentence that says what this tool does. Not what it could do, not what it might do. What it does. "Searches relevant web sources by a text query and returns a list of results with title, URL, and snippet." If you can't write this sentence clearly, you don't understand the tool well enough to let an agent use it.

Input Type and Required Fields. This is where most bugs originate. If you don't explicitly state that the search tool requires a non-empty text query, your agent will eventually send it an empty string and treat the empty result as "no matches found", which is a valid business outcome, not an error. The agent will report "no results" when the real situation is "I asked a broken question."

Required fields are the minimum viable input. If any required field is missing or invalid, the tool call should not happen. Period.

Optional Fields. These are fields that improve results but aren't required. Language, region, result limit, date range. The distinction matters because your agent should never fail because an optional field is missing. If it does, either the field isn't really optional or your validation logic is wrong.

Output Type. What the tool gives back. Not "results", that means nothing. "A list of objects, each containing title (string), url (string), and snippet (string, max 200 chars)." When you define output this precisely, your agent can validate what it received before using it. Without this definition, your agent trusts whatever comes back.

Validation Rule. How the agent verifies input before sending it. "Query must be a non-empty string under 500 characters." Simple, mechanical, testable. Your agent checks this before every tool call. If validation fails, the tool is never called, and the failure is logged as an input error, not a tool error.

Failure Modes. How this specific tool breaks. Not how tools in general break. This specific tool. Tavily Search fails with: timeout (network), rate limit (429), API error (500), empty result (valid but no matches). Each failure mode gets a different response. Timeout gets a retry. Rate limit gets a backoff. API error gets logged and escalated. Empty result gets passed through as a valid outcome.

Retry Allowed. Not "yes" or "no", "conditional." Retry makes sense for timeout and rate limit. Retry does not make sense for invalid input or permission errors. If you retry an invalid input, you get the same invalid response faster.

Fallback Exists. If Tavily is down, can you use Exa instead? If your primary DB is unreachable, can you queue the write? The existence of a fallback changes how you handle failure. With a fallback, a tool failure is a detour. Without one, it's a dead end.

Criticality. Is this tool critical (workflow cannot complete without it), high (important but has a fallback), medium (useful but skippable), or low (nice to have)? This determines whether a tool failure means the workflow fails, or the workflow continues without that tool's contribution.

Read or Write. This single field determines more about your security posture than any other. A read tool cannot damage your system. A write tool can. Every write tool needs stricter validation, approval considerations, and failure handling than a read tool. If you don't mark this distinction, you're treating "search the web" and "write to the production database" with the same level of caution.

Security Notes. Tool-specific safety constraints. "Results must not be accepted as validated without cross-reference." "Write operations must target staging tables only." "API key must not be logged." These are the rules that prevent the tool from being used in ways that seem correct but cause damage.

Used By. Which agents use this tool. When you update a tool's contract, you know exactly which agents are affected. Without this field, a tool change is a blind update that might break agents you forgot about.

Why this prevents 80% of bugs

Take the three failure categories I mentioned at the start:

Wrong input to tool: Prevented by Required Fields + Validation Rule. The agent checks before calling.

Misinterpreted output: Prevented by Output Type. The agent knows exactly what to expect and can validate what it received.

Unhandled failure: Prevented by Failure Modes + Retry + Fallback + Criticality. Every failure path is predefined.

The remaining 20% of bugs come from logic errors in the workflow itself, the agent does the right things in the wrong order, or makes a wrong decision at a branch point. Tool contracts don't fix that. But they eliminate the noise so you can see the real logic problems clearly.

How to start

You don't need to write contracts for every tool on day one. Start with your write tools, the ones that change state in your system. A search tool with a bad contract gives you bad results. A database write tool with a bad contract gives you corrupted data. Prioritize by damage potential.

Then do your critical tools, the ones without which your workflow cannot complete. If your workflow depends on Tavily Search and Tavily goes down, what happens? If you don't have an answer written down, you're going to find out at 3 AM.

Finally, do the rest. Once you've done the dangerous and critical tools, the others are straightforward because you already have the pattern.

One file per tool. That's all it takes.

Fifteen fields. One page. Written once, used by every agent that touches that tool. Updated when the tool changes, and every agent that uses it automatically gets the update because they reference the same contract.

This is not documentation for humans. This is a protocol for systems. The discipline that turns demo agents into shippable systems.

— ORBIRESEARCH