Exponential Backoff
A retry strategy where the delay between attempts increases exponentially, preventing overload on failing services.
Exponential backoff is a retry strategy where each successive retry waits longer than the previous one, typically doubling the delay. Instead of hammering a failing service every second, you wait 1s, then 2s, then 4s, then 8s, and so on.
The goal is to give the failing service time to recover. If a database is overloaded or a third-party API is rate-limiting you, retrying immediately and repeatedly only makes the problem worse. Exponential backoff reduces the pressure, increasing the chance of recovery.
A common addition is jitter: adding a small random offset to each delay. Without jitter, all clients that failed at the same time will retry at the same time, causing a "thundering herd" that overloads the service all over again. Random jitter spreads retries out over time.
In practice, you also want a maximum delay cap. Without a cap, backoff times can grow to minutes or hours, which is rarely what you want. A typical setup might cap at 30 or 60 seconds.
At some point, retrying stops making sense. If a message handler has failed 10 times with exponential backoff, the issue is likely permanent: a bug in the handler, invalid data, or a schema mismatch. This is where the dead letter queue comes in. After exhausting retries, move the message aside for inspection instead of retrying forever.
In message-driven systems, exponential backoff is often implemented as middleware, combined with a poison queue for messages that exhaust all retries.
The trade-off is latency. Every retry adds delay before the message is successfully processed. For most event-driven workloads, this is acceptable. Failed messages that eventually succeed are far better than dropped messages or cascading failures.
References
- Watermill 1.4 Released (Event-Driven Go Library) — Introduces a universal requeuer with delayed retries for failed messages. Combines the Poison middleware, DelayOnError middleware, and a PostgreSQL queue adapter to avoid a broken message blocking the entire topic. Failed messages reappear in the queue after a delay, giving the system time to recover.
- Durable Background Execution with Go and SQLite — Uses a retry middleware paired with a chaos middleware to prove atomicity of event handlers. The chaos middleware breaks handlers half the time, and the retry middleware ensures operations complete despite the failures.
- Watermill Retry Middleware