Congestion control in busy Applications with Exponential back-off and Jitter
Handling inevitable failures like a boss π
Premise
So your favorite YouTuber has uploaded a video and you are amongst the other thousands waiting to watch the video. The server is under load, and a few request queries are getting dropped. The UI shows the video is buffering but in the background, the system is trying to create a TCP connection with the server.
Imagine in such situations where queries are getting dropped and throttled the developers have decided to go with brute force retry logic that looks something like this.
while(True)
{
retry_shamelessly()
}
# Yes I preferes parenthesis in the newline B)
Some men/women just want to watch the world burn.
Consequences
It would create a backlog for the request queue and would break the system.
Intuition
In order to create room for the request queue to breathe developers use an algorithm called Exponential-backoff with the additional support for jittering.
Exponential-backoff
Overview
As the user makes requests in progress and the request still fails itβs time to slow down the user request as the system is still under pressure.
Exponential backoff as the name suggests is delaying the requests in an exponential fashion which is capped by an upper-limit time presented by tmax
Algorithm
Notations:
capped_time = tmax
current_request_number = n
base_time = b
exponent = e
sleep_time_between_call = min(tmax,b*e^n)
So when you see:
Exponential backoff with jitter using a base time of 1 second and an exponent of 2, with the maximum wait time between calls is 30 seconds
sleep_time = min(30,2^n)
The system would exponentially increase the time from 1,2,4,8,16 seconds and would cap to 30 seconds till the threshold is reached
Thought
What if many users enter into the exponential backoff at the same start time? This will again create a race condition amongst these sets of users. The developers have two choices
Drop the failing connection permanently
Modify exponential backoff to add randomness between calls
This thought of adding randomness in the sleep time is called Jitter.
Jitter
In order to randomize the calls and meditate on layered race conditions, we add random sampling for sleep time for users. The algorithm is also very simple, basically choosing a time between 0 and exponenial_backoff time.
Notations:
capped_time = tmax
current_request_number = n
base_time = b
exponent = e
sleep_time_between_call = random_sample(0,min(tmax,b*e^n))
Conclusion
The diagrams mentioned below compare the Retry-Strategy for two different users retrying at the same time. The key observation is to compare how Exponential backoff with Jittering proposes two discrete times for the users to resolve plausible race conditions.
Tradeoffs?
When comparing the performance of the algorithms we get the following tradeoffs.
Bruteforce :
Execution time is fast
No reliability for successful execution
Exponential-backoff :
Execution time is really high
Reliable with a corner case for parallel race conditions amongst users
Exponential-backoff with Jitter
Has Moderate runtime
Reliable with minimal chances of race conditions amongst users.
Thus itβs always advisable to choose Exponential-backoff with Jittering while designing the retry mechanism for your micro-service.
Insightful. Thank you!
nicely put