Retry, Backoff, and Circuit Breakers for LLM API Calls, The Lab

> ../patterns/retry_backoff_circuit_breakers.md

Retry LLM API calls on 429, 5xx, and timeouts with exponential backoff plus jitter, honor the Retry-After header when the provider sends one, cap total retry time with a deadline, and wrap the whole thing in a circuit breaker so a degraded provider cannot take your system down with it. Never retry non-idempotent actions blindly, and never retry validation errors at all. That is the summary. The rest of this article is why each rule exists, what the naive version costs, and the code shape we ship.

§ 01 · LLM APIs fail differently

Every API fails. LLM APIs fail with personality. Four modes dominate:

Rate limits (429). The most common and the most misunderstood. A 429 is not an error in your code, it is the provider telling you to slow down. We once treated it as fatal and it cost us a night of production downtime, the full story is in our 429 postmortem.

Overload (529 and friends). Provider-side congestion. Distinct from 429 because slowing down helps your standing but the recovery timeline is not yours to control.

Timeouts. Long generations on loaded infrastructure can exceed any sane client timeout. The dangerous part: a timeout does not mean the request failed. The provider may have completed it after you hung up, which matters enormously if the call had side effects downstream.

Degraded output. The sneakiest mode: the API returns 200 and the content is wrong, truncated, or malformed. No retry logic sees this unless your validation layer feeds it. Silent regressions are their own topic, and we wrote one.

── The retry ladder ──

§ 02 · The retry ladder

The first decision is not how to retry, it is whether.

Retry freely: 429, 5xx, network errors, timeouts on read-only calls. These are transient by definition.

Retry carefully: timeouts on calls whose results trigger side effects. The request may have succeeded invisibly. Before retrying, the system needs an idempotency story, a way to guarantee the side effect happens once even if the call happens twice. We learned that one expensively.

Never retry: 400s, validation errors, authentication failures, content policy refusals. The request is wrong, and resending a wrong request is paying twice for the same no. Retrying a 401 in a loop is how you turn an expired key into a locked account.

── Backoff, jitter, and the Retry-After header ──

§ 03 · Backoff, jitter, and the Retry-After header

Immediate retries against a rate-limited endpoint are a self-inflicted denial of service: every client that failed at second zero retries at second one, together, and the stampede keeps the endpoint saturated. Two mechanisms break the stampede.

Exponential backoff spreads retries over time: wait 1s, then 2s, then 4s, then 8s. Jitter spreads them across clients: randomize each wait so a thousand failed requests do not become a thousand synchronized retries. And when the provider sends a Retry-After header, that number wins over your formula, it is the provider telling you exactly when to come back.

async function callWithRetry(request, opts = {}) {
  const maxAttempts = opts.maxAttempts ?? 5;
  const deadline = Date.now() + (opts.maxTotalMs ?? 60_000);

  for (let attempt = 1; ; attempt++) {
    try {
      return await llm.call(request);
    } catch (err) {
      if (!isRetryable(err) || attempt >= maxAttempts) throw err;

      const retryAfter = err.headers?.["retry-after"];
      const base = retryAfter
        ? Number(retryAfter) * 1000
        : Math.min(1000 * 2 ** (attempt - 1), 30_000);
      const wait = base / 2 + Math.random() * (base / 2); // jitter

      if (Date.now() + wait > deadline) throw err; // budget exhausted
      await sleep(wait);
    }
  }
}

Two details in that snippet carry most of the value. The deadline: retries consume time your user or your queue is waiting through, and unbounded patience is not resilience, it is a hung system with good intentions. And the cap on backoff: waiting 512 seconds because the formula says so helps nobody.

── Circuit breakers, or knowing when to stop knocking ──

§ 04 · Circuit breakers, or knowing when to stop knocking

Retries handle a request that failed. Circuit breakers handle a provider that is failing. The difference matters: when an endpoint is down for ten minutes, ten thousand well-behaved retrying requests are still ten thousand requests achieving nothing, burning your rate limits, your latency budgets, and your queue depth.

A circuit breaker watches the failure rate. When it crosses a threshold, the circuit opens: calls fail fast without touching the API. After a cooldown, a few probe requests test the water. Success closes the circuit, failure keeps it open. Three states, one job: stop paying for calls that cannot succeed.

In agent systems the breaker earns its keep twice over, because agents retry at the task level too. An agent that cannot reach its model will often rephrase, replan, and try again, multiplying the underlying API calls. Without a breaker, one degraded provider turns a patient agent into a very expensive metronome. Breaker state changes belong on your dashboard, and if they surprise you there, your instrumentation has gaps.

── The queue is your shock absorber ──

§ 05 · The queue is your shock absorber

Everything above assumes a request in flight. The system-level question is what happens to work that arrives while the provider is down. The answer that survives production: a persistent queue. Requests land in the queue, workers drain it through the retry and breaker machinery, and a provider outage becomes a growing queue instead of lost work. When the provider recovers, the queue drains, and nothing that mattered disappeared during the gap. Every system we ship is built on this shape, most visibly the nine-agent logistics fleet, where a restart or an outage resumes from the last confirmed message.

── The fallback model is uptime engineering ──

§ 06 · The fallback model is uptime engineering

This month a frontier model vanished from the market for twenty days on a government decision. No retry policy fixes that. The last layer of this stack is a fallback model: a second provider or a smaller model, pre-tested against your workload, behind the same interface. Not because the fallback is as good, but because degraded service beats no service, and because the switch has to be a config change, not an engineering sprint that starts the morning of the outage.

── The checklist ──

§ 07 · The checklist

[ ] 429 and 5xx retried with exponential backoff and jitter

[ ] Retry-After header honored over the local formula

[ ] Total retry time capped by a deadline, per request

[ ] Non-idempotent calls have an idempotency key before any retry

[ ] 400s, auth failures, and refusals are never retried

[ ] Circuit breaker per provider, state visible on the dashboard

[ ] Persistent queue in front of the model, drains after outages

[ ] Fallback model tested against the real workload, switchable by config

── Closing ──

§ 08 · Closing

None of this is exotic. Backoff, breakers, queues, and fallbacks predate language models by decades, and the teams that treat an LLM endpoint like any other unreliable dependency get boring, reliable systems. The teams that treat it as magic get to write postmortems. We know, we wrote ours, and building so the next one never happens is most of what production-grademeans.

── End of pattern ──

ORBIRESEARCH