Retry
Automatically re-attempting a failed operation, often with exponential backoff, to handle transient errors.
A retry is when you automatically re-attempt a failed operation instead of giving up on the first error. Most failures in distributed systems are transient: a network hiccup, a brief database overload, a service restarting, etc. Retrying after a short delay often resolves the issue without any human intervention.
The simplest retry strategy is to try again immediately. This works, but it can make things worse. If the database is struggling under load, hammering it with instant retries only adds fuel to the fire.
A better approach is exponential backoff: increase the delay between each retry attempt. Wait 100ms, then 200ms, then 400ms, and so on. This gives the failing service time to recover. Adding a small random jitter on top prevents multiple consumers from retrying in sync and causing a thundering herd.
In message-driven systems, retries happen naturally. When a handler returns an error, the message is nacked and the message broker redelivers it. A retry middleware can add more control: the maximum number of attempts, the backoff strategy, and what happens when all retries are exhausted.
This is where the dead letter queue comes in. After a message fails all retry attempts, you move it aside so it does not block other messages. The combination of retries with a dead letter queue is a common and effective pattern.
One important constraint: retries only make sense if your handlers are idempotent. Since the same message can be processed more than once, the handler must produce the same result regardless of how many times it runs. This is closely related to at-least-once delivery.
Not every error is worth retrying. A malformed message will fail every time, no matter how many attempts you make. Distinguish between temporary errors (network timeouts, rate limits) and permanent errors (invalid data, missing fields). Retry the former, send the latter straight to the dead letter queue.
References
- Distributed Transactions in Go: Read Before You Try — Explains that retrying failed operations internally is sometimes better than showing an error to the user. The issue might be temporary, and a retry is often a good enough fix.
- Increasing Cohesion in Go with Generic Decorators — Lists retries as a common cross-cutting concern that can be handled with the decorator pattern, alongside logging, metrics, and tracing.
- Durable Background Execution with Go and SQLite — Shows how to pair a chaos middleware with the retry middleware to prove atomicity in event handlers. If a handler breaks halfway through, retries ensure the operations complete.
- Shipping an AI Agent that Lies to Production: Lessons Learned — Discusses how failures and retries happen behind the scenes in event-driven LLM pipelines, all hidden from the user behind a single response.
- Synchronous vs Asynchronous Architecture — Covers retrying failed jobs as part of async architecture. Message brokers handle redelivery automatically, but you need to consider how retries interact with message ordering and system load.
- Watermill Retry Middleware