← Back to The Lab
§ PatternApril 21, 202610 min

The seven tests your agent must pass before it touches production

Not seven categories. Seven specific, concrete test scenarios, with expected outcomes defined before you run them. No exceptions.

ShareXLinkedInFacebook

Every agent in our system must pass seven test scenarios before it gets a "build-ready" status. Not seven categories of tests. Seven specific scenarios, each with an expected outcome written down before the test is run.

This is not optional. It's a gate. If the tests don't exist, the agent doesn't deploy.

Here's why these seven, and exactly how to write each one.

Why seven and not more

You could write fifty test scenarios for any agent. You'd be right to do so eventually. But fifty tests before first deployment means the agent never deploys. The first deployment gets delayed, then delayed again, then abandoned.

Seven is the minimum viable set that covers the critical paths. Happy path, four failure modes, and two control flow tests. If your agent passes these seven, it won't be perfect, but it won't be dangerous. Everything after seven is refinement.

Test 1: Happy path

What you test: The ideal scenario where everything works perfectly.

Expected outcome: Agent receives valid input, executes all workflow steps in order, produces correct output, writes to the correct location, and closes with status "completed."

Why it matters: If the happy path doesn't work, nothing else matters. This is your baseline. Every other test is a deviation from this.

How to write it: Take your workflow definition from Miro Frame 03 and trace through every step with valid inputs and functioning tools. The expected outcome is literally your workflow definition executed without errors.

Common mistake: Testing the happy path with a trivially simple input. Your happy path test should use a realistic, complex input, the kind of input the agent will see in production.

Test 2: Invalid input

What you test: What happens when the agent receives input that doesn't meet requirements.

Expected outcome: Agent detects the invalid input, does NOT proceed to the main workflow, logs the input failure, and closes with status "failed_input_validation." No writes to any table except workflow_runs.

Why it matters: This tests your input validation layer, Miro Frame 02. If invalid input gets through, every downstream step is working with garbage.

How to write it: Look at your required fields from the trigger definition. Remove one. Corrupt another. Send an empty payload. Each variation should produce the same result: immediate rejection, clean status, no side effects.

Test 3: No results

What you test: What happens when the agent works correctly but the search/retrieval finds nothing useful.

Expected outcome: Agent runs the workflow, searches correctly, finds nothing that meets the relevance threshold, and closes with status "no_results." This is NOT "failed", it's a valid outcome that means "I looked properly and there's nothing there."

Why it matters: The most common production confusion is treating "no results" as an error. If your agent reports "failed" when it simply found nothing, you'll waste hours investigating non-problems. Worse, you might add retry logic for a situation that doesn't improve with retrying.

How to write it: Use a valid input that targets a deliberately narrow or empty search space. The expected output is: workflow executes, search returns empty, agent reports no_results, no writes to staging.

Test 4: Tool failure

What you test: What happens when a critical tool is unavailable or returns an error.

Expected outcome: Depends on your tool contract. If the tool has retry logic: agent retries N times. If retry fails and a fallback exists: agent uses fallback. If no fallback: agent closes with "failed" and a clear error log indicating which tool failed and why.

Why it matters: This tests your reliability layer, Miro Frame 07. In production, tools fail. APIs go down. Rate limits hit. Network blips happen. Your agent's behavior during tool failure is the difference between a robust system and a fragile one.

How to write it: Mock the tool to return a 500 error or timeout. Verify that the agent follows the exact retry/fallback/failure path defined in the tool contract. Check that the log contains: which tool, what error, how many retries, whether fallback was attempted, and final status.

Test 5: Write failure

What you test: What happens when the agent's work is done but the write to storage fails.

Expected outcome: Agent detects the write failure, does not treat the run as successful, logs the write error, and closes with either "partial_success" (if delivery still worked) or "failed" (if both write and delivery failed). The agent must NOT re-attempt the entire workflow just because the write failed.

Why it matters: Write failures are the most dangerous because the agent has already done the work. The temptation is to retry the whole workflow, which can produce duplicate results. The correct behavior is: retry only the write, not the workflow.

How to write it: Mock the database to reject the insert (permission error, connection timeout, or payload rejection). Verify: no duplicate workflow execution, clean error log, correct status, delivery still attempted if possible.

Test 6: Approval path

What you test: What happens when the workflow reaches an approval gate.

Expected outcome: Agent prepares the proposal, sends it for approval, pauses execution, and waits. When approved: continues. When rejected: stops cleanly. When timeout: stops with "pending_review" status.

Why it matters: If your agent has approval gates (and most production agents should for write operations), you need to verify that the gate actually stops execution. An approval gate that logs "waiting for approval" but continues executing is worse than no gate at all.

How to write it: Three sub-tests. Submit and approve: verify execution continues and completes. Submit and reject: verify execution stops cleanly with "rejected" status. Submit and wait: verify the agent does not proceed after timeout period.

Test 7: Fallback path

What you test: What happens when the primary tool fails and the fallback activates.

Expected outcome: Primary tool fails, agent activates fallback tool, fallback produces results (possibly lower quality), agent completes with "completed" or "partial_success" depending on fallback quality, and the log clearly shows which path was taken.

Why it matters: Fallbacks are your safety net. If they're defined in your tool contracts but never tested, you don't have a safety net, you have a wish.

How to write it: Disable the primary tool. Verify the fallback activates, produces usable results, and the agent handles the transition cleanly. Check that the log distinguishes between primary and fallback tool usage.

The rule that ties them together

Every test has an Expected Outcome column that is written BEFORE the test runs. If you write the expected outcome after seeing the actual result, you're not testing, you're documenting.

Our Test & Incident Database has these columns for exactly this reason: Expected Outcome (filled before test), Actual Outcome (filled after test), Status (Passed/Failed). If Expected matches Actual, it passes. If not, you have a bug to fix.

Seven tests. Seven expected outcomes. Written before deployment. No exceptions.

— ORBIRESEARCH

ShareXLinkedInFacebook