Resiliency in Distributed Systems: Timeouts, Retries, and Idempotency
Resiliency in Distributed Systems: Timeouts, Retries, and Idempotency
In distributed systems, the idea that "everything fails all the time" is a core principle. In his InfoQ presentation, Sam Newman breaks down three fundamental patterns for building resilient systems that can withstand transient failures: timeouts, retries, and idempotency. This post summarizes the key takeaways from his talk.
The Three Golden Rules of Distributed Computing
Before diving into the patterns, it's essential to remember the "Three Golden Rules" that govern distributed systems:
- Information transmission takes time.
- Components can be unreachable.
- Resources are finite.
These rules highlight why building distributed systems is inherently complex and why resilience cannot be an afterthought.
1. Timeouts
A timeout is a client-side mechanism that terminates a request after a certain period. It's a crucial tool for preventing resource saturation and ensuring a good user experience.
- Why they're crucial: Without timeouts, a struggling downstream service can cause a cascade failure by holding up resources (like threads or connections) in the upstream service.
- Challenges:
- Too fast: Timing out too quickly can waste resources on requests that might have succeeded.
- Too slow: Timing out too slowly ties up resources and degrades system stability.
- Strategies:
- Use data: Analyze performance data (histograms, not averages) to set realistic timeout values based on normal operational behavior.
- Make them configurable: Timeouts should be tunable without requiring a code redeployment, allowing for adjustments during load testing or production incidents.
The primary goal of a timeout is to protect the health of your system, even at the cost of a single failed request.
2. Retries
Many failures in distributed systems are transient (e.g., a temporary network glitch). Retrying a failed request is often a simple and effective way to achieve success.
- Why they're necessary: Transient failures are a fact of life. A simple retry can often resolve the issue without any other intervention.
- Challenges: Uncontrolled retries can turn a minor issue into a major outage. A struggling service can be overwhelmed by a "thundering herd" of retry attempts, leading to a vicious cycle of failures.
- Strategies:
- Limit retries: Always cap the number of retry attempts.
- Use delays: Introduce a delay between retries to give the downstream service time to recover.
- Add jitter: Add a small amount of randomness to the delay to prevent clients from retrying in synchronized waves.
- Exponential backoff: Gradually increase the delay between retries, but be mindful of the total timeout budget.
3. Idempotency
An operation is idempotent if it can be performed multiple times without changing the result. This is the key to making retries safe.
- The Problem: When a client retries a request due to a timeout, it doesn't know if the original request was processed or not. If the operation is not idempotent (e.g., a payment transaction), a retry could lead to unintended consequences like a double charge.
- The Solution: Request IDs:
- The client generates a unique ID for each request.
- The server stores this ID upon successful processing.
- If the server receives a request with an ID it has already processed, it doesn't re-execute the operation. Instead, it returns the original, saved response.
Designing for idempotency with request IDs from the beginning is far easier than trying to retrofit it into an existing system.
Conclusion
Timeouts, retries, and idempotency are not magic bullets, but they are essential building blocks for creating resilient distributed systems. Libraries can help implement these patterns, but they require careful configuration and a deep understanding of the business implications of failure. By embracing the reality of transient failures and designing for them proactively, you can build more robust and reliable applications.
This blog post is a summary of the key points from Sam Newman's InfoQ presentation, "Timeouts, Retries and Idempotency In Distributed Systems".