Congestion control in busy Applications with Exponential back-off and Jitter

Handling inevitable failures like a boss 😎

Aug 12, 2022

Premise

So your favorite YouTuber has uploaded a video and you are amongst the other thousands waiting to watch the video. The server is under load, and a few request queries are getting dropped. The UI shows the video is buffering but in the background, the system is trying to create a TCP connection with the server.

Imagine in such situations where queries are getting dropped and throttled the developers have decided to go with brute force retry logic that looks something like this.

Overview of Retry logic for a client/server pair

while(True)
{ 
   retry_shamelessly() 
}
# Yes I preferes parenthesis in the newline B)

Text within this block will maintain its original spacing when published
Some men/women just want to watch the world burn.

Consequences

It would create a backlog for the request queue and would break the system.

Intuition

In order to create room for the request queue to breathe developers use an algorithm called Exponential-backoff with the additional support for jittering.

Exponential-backoff

Overview

As the user makes requests in progress and the request still fails it’s time to slow down the user request as the system is still under pressure.

Exponential backoff as the name suggests is delaying the requests in an exponential fashion which is capped by an upper-limit time presented by tmax

Algorithm

Notations:
capped_time = tmax
current_request_number = n
base_time = b
exponent = e

sleep_time_between_call = min(tmax,b*e^n)

So when you see:

Exponential backoff with jitter using a base time of 1 second and an exponent of 2, with the maximum wait time between calls is 30 seconds

sleep_time = min(30,2^n)

The system would exponentially increase the time from 1,2,4,8,16 seconds and would cap to 30 seconds till the threshold is reached

Thought

What if many users enter into the exponential backoff at the same start time? This will again create a race condition amongst these sets of users. The developers have two choices

Drop the failing connection permanently
Modify exponential backoff to add randomness between calls

This thought of adding randomness in the sleep time is called Jitter.

Jitter

In order to randomize the calls and meditate on layered race conditions, we add random sampling for sleep time for users. The algorithm is also very simple, basically choosing a time between 0 and exponenial_backoff time.

Notations:
capped_time = tmax
current_request_number = n
base_time = b
exponent = e

sleep_time_between_call = random_sample(0,min(tmax,b*e^n))

Conclusion

The diagrams mentioned below compare the Retry-Strategy for two different users retrying at the same time. The key observation is to compare how Exponential backoff with Jittering proposes two discrete times for the users to resolve plausible race conditions.

Tradeoffs?

When comparing the performance of the algorithms we get the following tradeoffs.

Bruteforce :
1. Execution time is fast
2. No reliability for successful execution
Exponential-backoff :
1. Execution time is really high
2. Reliable with a corner case for parallel race conditions amongst users
Exponential-backoff with Jitter
1. Has Moderate runtime
2. Reliable with minimal chances of race conditions amongst users.

Thus it’s always advisable to choose Exponential-backoff with Jittering while designing the retry mechanism for your micro-service.

Service Principle 🖥

Discussion about this post