Last Tuesday at 2:47 AM, my Trend Hunter agent failed. The log showed a clean 429 (Too Many Requests) from the Exa API, followed by what looked like a normal exponential backoff, followed by silence. No completion. No failure status. The workflow just stopped.
I assumed it was a transient rate limit and restarted. It failed again. Same pattern. Clean 429, backoff, silence.
It took me three hours to find the real problem. I'm writing this so it takes you three minutes.
What I thought happened
My first assumption was simple: Exa rate-limited me, my retry logic kicked in, and something in the retry timing was off. Maybe the backoff multiplier was too aggressive. Maybe I was hitting a daily limit, not a per-minute limit.
I checked the retry configuration: 3 attempts, exponential backoff starting at 2 seconds (2s, 4s, 8s), timeout at 30 seconds per attempt. That all looked correct. The logs confirmed all three retries executed.
But the third retry never completed. It started, sent the request, and then, nothing. No response logged. No timeout. No error. Just a gap in the log.
What actually happened
The real problem was not in the retry logic. It was in what happened during the retry.
While my agent was waiting for retry number two, the cron scheduler fired the next daily run. Now I had two instances of Trend Hunter running simultaneously, both hitting the same Exa API key. The second instance sent its requests, got a 429 (because instance one had already consumed the rate limit), started its own retry cycle, and the two instances began competing for the same rate limit budget.
By retry three, both instances were sending requests within milliseconds of each other. Exa responded to one and dropped the other. The dropped request never got a response , not a 429, not a timeout, just a TCP connection that hung until the process was killed by the operating system hours later.
The silence in my log wasn't a bug in my retry logic. It was a hanging HTTP connection caused by concurrent instances fighting for the same rate limit.
The trace that revealed it
Here's what I should have seen immediately but missed because my logging was too sparse:
The first thing I added was instance identification. Every run now logs a unique run_id at startup. When I re-ran with this logging, I immediately saw two different run_ids overlapping in the same time window.
The second thing I added was request-level logging with timestamps accurate to the millisecond. This showed me that retry three from instance one and the initial request from instance two were sent 12 milliseconds apart.
The third thing I added was connection state logging. This showed me that retry three's connection stayed in state "AWAITING_RESPONSE" for 47 minutes until the OS killed it.
The fix
Three changes, each preventing a different layer of this problem:
Fix 1: Concurrency lock. Before starting a run, the agent checks a "running" flag in the workflow_runs table. If another instance is already running, the new instance exits immediately with status "skipped_concurrent_run." This prevents two instances from ever running simultaneously.
Fix 2: Connection timeout. Every HTTP request now has a hard connection timeout of 10 seconds (separate from the response timeout of 30 seconds). If the TCP connection doesn't establish within 10 seconds, the request fails immediately instead of hanging indefinitely.
Fix 3: Rate limit awareness. Before sending a request, the agent checks how many requests it has sent in the current window. If it's within 80% of the known rate limit, it waits until the window resets before sending. This is proactive rate limiting, not reactive retry after hitting the wall.
The regression test
I added a test specifically for this scenario. The test simulates:
· Start run A
· While A is in retry-backoff, start run B
· Verify B exits immediately with "skipped_concurrent_run"
· Verify A completes normally
· Verify no hanging connections exist after both runs finish
This test now runs before every deployment. It catches the exact failure pattern that cost me three hours to debug manually.
What I learned
The retry pattern itself was fine. Exponential backoff with 3 attempts and reasonable timeouts is correct. The problem was everything around the retry: no concurrency protection, no connection timeout, no proactive rate awareness, and insufficient logging to see what was really happening.
A retry pattern is not a reliability strategy. It's one component of a reliability strategy that must also include: concurrency control, connection management, proactive rate limiting, and logging detailed enough to reconstruct what happened.
If your agent has retry logic but not these other four things, your retry logic is hiding problems, not solving them.
Checklist after this incident
· Does your agent prevent concurrent instances from running?
· Do your HTTP connections have a hard timeout separate from response timeout?
· Does your agent proactively check rate limits before sending requests?
· Does every run log a unique run_id?
· Can you reconstruct the exact sequence of events from your logs?
If any answer is no, you have the same vulnerability I had.
— ORBIRESEARCH