Cache Stampede: The Thundering Herd Problem

What is the thundering herd problem?

Imagine this - You have an e-commerce site with hundreds of thousands of users viewing a bunch of your products. To decrease load, you cache product details so that your database isn't consistently hammered by queries asking for the same result. Everyone's happy.

Now, imagine that certain products are much more read than others (e.g. a certain shirt is 90% off). As a result, the cache entry for this product will be hit much more than others, probably by multiple requests at the same time. What happens once this cache entry expires? The requests will experience a cache miss and will then try to query the database. All of a sudden, your database is being hammered by a bunch of requests that are just requesting for the same data!

Product: Limited Edition Tee — 90% off

1/5

A popular product is cached. Hundreds of users are requesting it simultaneously.

This is the thundering herd problem. Formally, it's the situation wherein a large number of processes triggered by an event are awoken simultaneously when that event triggers, which requires processing power our system can't take.

How do we deal with the thundering herd problem?

What is lock-based prevention?

The main problem we need to address is the large number of requests hitting our database at once. Since they're all asking for the same thing, why don't we:

Just allow one request through to query the database and populate the cache
Have all the other requests wait for the cache to be repopulated

This is what lock-based prevention achieves. Each request tries to acquire a lock, which there can only be one of at a time. If a request acquires the lock, that request is allowed to query the database and populate the cache. If a request doesn't acquire a lock, that means another request already has it. As a result, the request that failed to acquire the lock will just wait and recheck the cache to see if it's been repopulated.

Solution: Lock-based prevention

1/6

Same scenario — cache expires while many requests are incoming.

How do we implement lock-based prevention?

Here's what the code looks like in Typescript:

TypeScript

1fastify.get<{ Params: IdParams }>(
2  "/products/:id",
3  async (
4    request: FastifyRequest<{ Params: IdParams }>,
5    reply: FastifyReply,
6  ) => {
7    const requestId = randomUUID();
8    const { id } = request.params;
9    const productId = parseInt(id, 10);
10
11    if (isNaN(productId)) {
12      return reply.code(400).send({ error: "Invalid product ID" });
13    }
14
15    const cacheId = `product-${productId}`;
16    const cachedProduct = await redis.get(cacheId);
17
18	// Cache hit
19    if (cachedProduct) {
20      return {
21        data: JSON.parse(cachedProduct),
22      };
23    }
24	
25	// Attempt to acquire lock
26    const TTL = 300;
27    const lockId = `product-lock-${productId}`;
28    const productLock = await redis.set(lockId, requestId, "NX", "PX", TTL);
29
30	// Failed to acquire lock.
31	// Wait and recheck cache
32    if (!productLock) {
33      const MAX_ATTEMPTS = 5;
34      const BASE_INTERVAL = 1;
35      const JITTER_CEILING = 10;
36      for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt += 1) {
37        const cachedProduct = await redis.get(cacheId);
38        if (cachedProduct) {
39          cacheStats.hits = cacheStats.hits + 1;
40          return {
41            data: JSON.parse(cachedProduct),
42          };
43        }
44
45        const jitter = Math.random() * JITTER_CEILING;
46        const interval = (BASE_INTERVAL + 2 ** attempt + jitter) * 100;
47
48        await new Promise((resolve) => setTimeout(resolve, interval));
49      }
50    }
51
52	// Extend lock if lock is about to expire
53	// but request is still fetching from database
54    const lockExtension = setInterval(async () => {
55      const ownerId = await redis.get(lockId);
56
57      if (ownerId === requestId) {
58        await redis.pexpire(lockId, TTL);
59      }
60    }, TTL - 100);
61
62    // Execute database query here
63	const result = await executeQuery("SELECT * FROM ...")
64
65	// Clear lock and delete lock extension
66    await redis.del(lockId);
67    clearInterval(lockExtension);
68	
69    if (result.rows.length === 0) {
70      return reply.code(404).send({ error: "Product not found" });
71    }
72	
73	// Repopulate cache
74    await redis.set(cacheId, JSON.stringify(result.rows[0]), "EX", 10);
75
76    return {
77      data: result.rows[0],
78    };
79  },
80);
81

Some implementation details to note:

Acquiring the lock (lines 25 - 28)

Here is where the request attempts to acquire the lock. Notice that it's just setting a value in Redis, but with a special NX flag, which only allows the insert to succeed if the key isn't in the cache yet. In this case, each request will attempt to set the key product-lock-${productId} and the only time this operation is successful is if this hasn't been set yet. If unsuccessful (i.e. request hasn't acquired the lock), the operation will return None

We use the PX flag to inform Redis that we're setting the TTL in milliseconds. If you wish to set the TTL in seconds instead, use the EX flag, but make sure not to mix these up! The difference between 300 milliseconds and 300 seconds is astronomical. The reason we set the TTL is that this makes the lock automatically expire, which is great for the case where the lock holder may crash before being able to release the lock.

Adding jitter (lines 32 - 50)

If a request was already able to acquire the lock, we can have multiple concurrent requests that are told to wait before retrying the cache. We use exponential backoff to compute how many seconds should pass between each retry. However, this leads us to another issue. Say that we have 5000 concurrent requests told to wait. Each of them is told to then retry in 5 seconds. What happens when 5 seconds pass?

The answer is that they'll all retry the cache again at the same time. We just created another stampede! If we want to prevent this, we can add jitter, which is a small and random delay introduced to each retry. This will de-synchronize the requests, preventing all of them from trying again at the exact same time.

0.00

Requests waiting to retry

2.00s

Click start to begin the countdown.

Implementing lock extension (lines 54 - 60)

Picture this: A request has successfully acquired a lock, setting a TTL of 300 milliseconds. It then goes off to query the database. However, the database query takes 500 milliseconds to complete. This means that the request will lose control of the lock before it has a chance to complete the query and repopulate the cache! As a result, another request will take the lock and will try to request the same data from the database. However, since the query takes longer than the TTL of the lock, we'll be stuck in a loop. How can we resolve this?

We could increase the TTL of the lock, but then you must consider what value to increase it to? You could increase it to a TTL of 500 milliseconds, but what if another query takes 800 milliseconds? You'd end up with the same problem!

Instead, what we can do is repeatedly increase the lifetime of the lock until the query completes. This is what lines 57 - 63 accomplish. Specifically, a request will:

Check if it owns the lock
If it doesn't own the lock, nothing happens
If it owns the lock, extend the lifetime of the lock via redis.pexpire

Cleanup (lines 66 - 67)

Once the query completes, we must delete the lock extension interval and also delete the lock itself. Why can't we just leave the lock? If we do, then if a future request encounters a cache miss in the future and the lock isn't released, then the request will think that another request is already working on refreshing the entry, which is not true! We must delete the lock to make it explicit that there is no request currently trying to query the database for the resource the lock is guarding.

What are the pros and cons of this approach?

The main benefit of this approach is that it guarantees that only a single request is querying the database at any given time, which is great if each query is expensive.

However, this approach has its problems:

Requests that are put on hold still consume resources. If we have a large number of these requests being put on hold, this can strain our system.
There are a lot of moving parts. You have to manage UUID's, renewal timers, and cleanups. If even one of these parts is faulty, that can break this approach.
It's reactive. It only fires when the cache is already expired.

The last point is worth reiterating. Lock-based prevention is reactive. It's because of this property that we have to put a lot of requests on hold. Is there a way we can proactively repopulate the cache? Is it possible to refresh the entry before it expires? The next approach does this!

What is probabilistic early expiration?

This method takes a different, more proactive approach. Instead of only allowing a single request to query the database and populate the cache, each request has a probability of doing so. In other words:

Each request computes a probabilistic value
If that probabilistic value is beyond a certain threshold, it will fetch from the database and repopulate the cache
If the value doesn't reach the threshold, it will just fetch from the cache

This value needs to have certain properties to be of practical use. Specifically, it must:

Meet the threshold much faster if the database query takes longer
Get closer to the threshold value as the current time gets closer to the expiration time of the cache entry

Solution: Probabilistic Early Expiration (XFetch)

Cache TTL8.0s remaining

Recompute threshold1.8%

Incoming requests

t=0.5s

t=1.2s

t=1.9s

t=2.6s

t=3.3s

t=4s

t=4.7s

t=5.4s

t=6.1s

t=6.8s

t=7.5s

Click start. Each request independently rolls a random number against the recompute threshold. As the TTL drops, the threshold climbs — making it more likely someone triggers a refresh.

The actual formula is as follows:

XFetch Formula

where:

= current timestamp
= timestamp when entry expires
= duration of the database query. Longer queries should trigger earlier refreshes
= tuning parameter. Higher leads to more aggressive refreshes.
= adds randomness to spread refreshes. We use to provide an exponential distribution, which leads to most requests not re-fetching. Without it, a lot more requests would end up re-fetching, leading to another stampede

Note

The other name for this approach is "XFetch", which stands for "Exponential Fetch" due to it leveraging exponential distribution to determine if an early recomputation will trigger. More about this approach here

How do we implement probabilistic early expiration?

Here's what the code looks like in Typescript:

TypeScript

1  fastify.get<{ Params: IdParams }>(
2    "/products/:id",
3    async (
4      request: FastifyRequest<{ Params: IdParams }>,
5      reply: FastifyReply,
6    ) => {
7      const requestId = randomUUID();
8      const { id } = request.params;
9      const productId = parseInt(id, 10);
10
11      if (isNaN(productId)) {
12        return reply.code(400).send({ error: "Invalid product ID" });
13      }
14
15	  
16      const cacheId = `product-${productId}`;
17      const cachedEntry = await redis.get(cacheId);
18
19	  // Cache hit.
20	  // Request still checks if it should refetch
21      if (cachedEntry) {
22        const { data, delta, expiry } = JSON.parse(cachedEntry);
23		const BETA = 1;
24        const shouldRefresh =
25          Date.now() / 1000 - Math.log(Math.random()) * BETA * delta >= expiry;
26
27        if (!shouldRefresh) {
28          return {
29            data
30          };
31        }
32      }
33
34      // Lock approach
35	  // Used for when there is no cache entry yet
36      const lockId = `product-lock-${productId}`;
37      const LOCK_TTL = 2;
38      const lock = await redis.set(lockId, requestId, "NX", "EX", LOCK_TTL);
39
40      if (!lock) {
41        const MAX_ATTEMPTS = 5;
42        const BASE_INTERVAL = 10;
43        const JITTER_CEILING = 10;
44        for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt += 1) {
45          const cachedProduct = await redis.get(cacheId);
46          if (cachedProduct) {
47            const { data } = JSON.parse(cachedProduct);
48            cacheStats.hits = cacheStats.hits + 1;
49            return {
50              data
51            };
52          }
53          const jitter = Math.random() * JITTER_CEILING;
54          const interval = (BASE_INTERVAL + 2 ** attempt + jitter) * 100;
55
56          await new Promise((resolve) => setTimeout(resolve, interval));
57        }
58      }
59
60
61      // Execute database query here
62	  const result = await executeQuery("SELECT * FROM ...")
63
64      if (result.rows.length === 0) {
65        return reply.code(404).send({ error: "Product not found" });
66      }
67	  
68	  // Set the cache entry
69      const ENTRY_TTL = 10;
70      const entry = {
71        data: result.rows[0],
72        delta: result.executionTime / 1000,
73        expiry: Date.now() / 1000 + ENTRY_TTL,
74      };
75      await redis.set(cacheId, JSON.stringify(entry), "EX", ENTRY_TTL);
76	  
77	  // Delete lock if entry didn't exist yet
78      await redis.del(lockId);
79
80      return {
81        data: result.rows[0]
82      };
83    },
84  );
85

Some implementation details to note:

Always evaluate the trigger for early refetch (lines 21 - 32)

Even on a cache hit, the request should still evaluate if it will refresh the cache. Remember, this is a proactive approach where we try to refresh the cache before it expires.

Augment approach with lock-based prevention for cold caches (lines 36 - 58)

This approach assumes that the cache entry already exists, but what if the cache hasn't been populated yet? If this happens under heavy load, then multiple requests will try to hit the database concurrently (another stampede!). To address this, we augment xfetch with the lock-based approach to prevent a stampede on a cold cache. Here's how it works:

When cache is cold, employ lock-based approach to prevent stampede
When cache is populated, xfetch takes over to proactively refresh the cache

The cache entry must contain more than our desired data (lines 69 - 75)

When working, xfetch prevents a cache entry from ever expiring. In order to do this, what extra information does it need? It needs:

The expiration time for the entry - The closer we are to this expiration time, the higher the likelihood of a request triggering the refresh
The time it took to query the database - The longer this takes, the earlier we want a request to be triggered so we can account for this latency.

What are the pros and cons of this approach?

The main benefits of this approach are:

Simplicity - Unlike lock-based prevention, xfetch doesn't require any moving parts. Just one formula to determine if the request will trigger a refresh. Also
Proactivity - Xfetch attempts to refresh the cache before expiry. Statistically, when we're under load, the cache entries should never go cold
No waiting requests - Xfetch doesn't require any single request to wait for another request to populate the cache. As a result, these requests don't linger around and consume system resources

However, this approach's main disadvantage is it's not deterministic. It's possible that multiple requests will query the database for the same data. This makes xfetch suboptimal for situations where the database queries are expensive.

What have we learned?

You made it! Congratulations for making it this far. In short, here's what we covered:

The thundering herd problem is a situation wherein a large number of processes triggered by an event are awoken simultaneously when that event triggers, which requires processing power our system can't take. There are two approaches to preventing this - lock-based and xfetch.
Lock-based prevention is a reactive approach that allows only a single request to query the database directly while the rest wait and retry from cache. It's deterministic, but has a lot of moving parts.
Xfetch is a proactive approach wherein each request has a probability of triggering a refresh, but this probability starts off low, when the cache was newly refreshed, and increases as the cache is about to expire. It's simpler to implement, but it results in the possibility of multiple requests hitting the database with the same query.

I'll be covering more system design concepts in future posts, so stay tuned! Until then, happy coding🚀

bs_code

Cache Stampede: The Thundering Herd Problem

What is the thundering herd problem?

How do we deal with the thundering herd problem?

What is lock-based prevention?

How do we implement lock-based prevention?

What are the pros and cons of this approach?

What is probabilistic early expiration?

How do we implement probabilistic early expiration?

What are the pros and cons of this approach?

What have we learned?

Share

Like

Support