The agent that charged 312 customers twice, and the one missing field that caused it, The Lab

A billing agent retried a payment it had already completed and charged 312 customers a second time. Every individual step was correct. The defect was a missing idempotency key. Here is the trace, the mechanism, and the two controls that prevent it.

> ../failures/double_charge.md

§ 01 · What happened

A subscription billing agent ran a routine job: charge the accounts whose renewal fell on the first of the month. It read the due list, called the payment provider for each account, recorded the result, and moved on. It had run cleanly for four months.

On the run in question, the payment provider had a slow night. A batch of charge requests took longer than the agent's network timeout. The agent's HTTP client raised a timeout, the agent caught it, logged "charge failed, retrying," and charged again. For 312 accounts, the first charge had in fact succeeded on the provider's side. The timeout fired before the success response came back. The retry charged a card that had already been charged.

The customers saw two identical charges. Support saw the tickets ninety minutes later. The total exposure was real money against real cards, plus the trust cost, which is larger and does not refund.

—— Why every step was correct ——

§ 02 · Why no single step was wrong

This is the uncomfortable part. Read the agent's behavior line by line and every decision is defensible.

The timeout was correct: you do not want an agent hanging forever on a slow provider. Catching the timeout was correct: an uncaught exception would have halted the whole batch. Retrying on a transient network error was correct: retry on transient failure is the textbook pattern. Logging and continuing was correct.

The failure is not in any step. It is in an assumption that lives between the steps: that a timeout means the action did not happen. A timeout means you did not receive a confirmation. Those are not the same thing. The request may have completed on the far side while your confirmation was still in flight. The agent treated "I did not hear back" as "it did not happen," and acted on that as if it were fact.

This is the most common shape of a production agent failure. The model reasoned correctly. The tools worked. The system produced a wrong outcome because of an assumption no one wrote down and therefore no one tested.

—— The mechanism ——

§ 03 · The mechanism: retries without identity

A retry is safe only when the second attempt can be recognized as the same intent as the first. The mechanism that provides this is an idempotency key: a unique identifier the caller attaches to a request, so that the provider can detect a repeat and return the original result instead of performing the action twice.

The agent sent no idempotency key. From the provider's view, the retry was a brand new, unrelated charge request for the same amount on the same card. The provider did exactly what it was asked. It charged the card. There was no defect in the provider. There was no defect in the model. There was a missing field in the request.

The same class of bug appears anywhere an agent takes an action that should happen at most once: sending a message, creating a record, posting a transaction, kicking off a downstream job. If the action has no stable identity, a retry is a duplicate, and any production system retries.

—— The fix ——

§ 04 · The fix, in two layers

Layer one: an idempotency key on every action that must happen at most once. Derive the key from the intent, not from the attempt. For this agent the key was the account id joined with the billing period: one renewal per account per month has exactly one identity, no matter how many times the agent retries. With that key attached, the provider recognizes the retry and returns the original receipt. The card is charged once. The agent's "retry" becomes a safe read of a result that already exists.

Layer two: an append-only ledger the agent writes to before it calls out, not after. Before the charge, the agent writes an intent row keyed by the same idempotency key, in state "pending." After the provider responds, it updates the row to "charged" or "failed." On any retry, the agent reads the ledger first. If a row for this key is already "charged," there is nothing to do. The ledger is the agent's own memory of what it has attempted, independent of whether the provider's confirmation arrived. Append-only avoids the race where two concurrent runs both think they are first.

The two layers cover different failures. The idempotency key protects you when the provider supports it. The ledger protects you when it does not, when there are concurrent runs, and when you need an audit trail of intent versus outcome that does not depend on a third party's logs.

—— The test that now catches it ——

§ 05 · The regression test

The test does not check that a charge succeeds. It checks that a charge is not repeated when the confirmation is lost. The harness calls the agent against a stub provider that completes the action and then drops the response so the client times out. The agent retries. The assertion: the provider recorded exactly one charge, and the ledger holds exactly one "charged" row for the key.

This test would have failed on day one. It did not exist because the team tested the happy path and the obvious failures, and "the action succeeded but the confirmation was lost" did not feel like a failure worth simulating. It is the failure that costs the most, because the system looks like it is doing the right thing the entire time.

—— End of report ——

◆ A timeout means you did not get a confirmation. It does not mean the action did not happen. Never wire a retry to the opposite assumption.

◆ Any action that must happen at most once needs a stable idempotency key derived from the intent, plus an append-only ledger the agent checks before it acts.

◆ Test the lost-confirmation case explicitly. It is the failure that looks most like success.

— ORBIRESEARCH