srvjha

System Design: Thundering Herd Problem in Distributed Systems

Why traffic spikes crash systems and how modern architectures prevent it

24/02/2026

5 min read

System Design

thundering-herd

ChaiCode

server

backend

Databases

Load Balancing

What is Thundering Herd Problem ?

When CBSE 12th board result website announces that results will be out exactly at 10 AM on particular day. Before 10 AM, thousands of students are waiting and refreshing the page.

The moment the result goes live, all students send requests to the server at the same time. Instead of traffic increasing gradually, the system suddenly goes from almost no requests to thousands of requests in a split second. This sudden spike overwhelms the backend causing crashes.

This situation where large amount of clients simultaneously requests the resource after it becomes available called Thundering Herd Problem.

Technical Deep Dive

When a large number of processes or threads are waiting for the same event, and when that event happens they all wake up and attempt to acquire the same resource at the same time. This creates sudden spikes in load, leading to bottlenecks, high latency, wasted work, and system crashes.

After the processes wake up, they all demand the resource and a decision must be made as to which process can continue. After the decision is made the remaining processes are put back to sleep, only to wake up again to request access to the resource.

How thundering herd occurs during cache expiry or cache miss

When a large number of users request the same data, the backend first checks the cache.
If the data is present, it is returned immediately without touching the database. However, when the cache entry expires (TTL ends) or the cache is empty (cold start), all concurrent requests observe a cache miss.

Instead of a single request rebuilding the cache, thousands of requests simultaneously fall through to the database and attempt to fetch the same data.

This sudden spike overwhelms the database causing slowdowns, timeouts, or complete system failure.

This behavior is known as a cache stampede.

Impact on System

Unexpected CPU spikes
The system jumps from handling mostly cache reads to processing thousands of expensive database queries at once.

Increased latency
Requests queue up behind overloaded thread pools and database connections, causing response times to grow rapidly.

Resource waste
The same data is recomputed or fetched thousands of times even though only one result is needed, wasting CPU, memory, and network bandwidth.

Are normal spike and thundering herd the same problem ?

While you might wander after reading about till now that thundering herd is normal spike problem but its not correct, well they do result is sudden surge of traffic but fundamentally they are very different in terms of the pattern of their spike.

Normal Spike : Here the increase in traffic is often occurs at different time like breaking news , sales etc. It is often handled by autoscaling, load balancer.

  • Analogy : A busy day at a coffee shop with a long line forming over 30 minutes

Thundering Herd : A specific scenario where a massive number of requests hit a system simultaneously, often because a cache has expired or a resource has become available, causing all waiting processes to rush for it at once. Requires specific mitigation strategies to prevent it.

  • Analogy: The coffee shop door is locked until 9:00 AM, and 500 people try to push through the door at exactly 9:00:00 AM, breaking the door.

Some Well Known Case Studies

In 2010, Facebook experienced a multi-hour outage triggered by a configuration change to a persistent storage value. The new configuration was interpreted as invalid by cache clients across the infrastructure. Each client attempted to “repair” the invalid configuration by querying backend database clusters, generating hundreds of thousands of queries per second and causing widespread database failures.

Failure Sequence

  1. Initial condition: Configuration value cached across all application servers

  2. Trigger: Configuration updated to value that failed client-side validation

  3. Amplification: Each server detected invalid value and attempted database repair query

  4. Cascading failure: Database clusters overwhelmed by synchronized query load

  5. Recovery impediment: Cache invalidation loops prevented normal service restoration

How to prevent thundering problem ?

To prevent this we have few strategies that we can follow to avoid such problem

  • Request coalescing : It Prevents thundering herd by ensuring that when many identical requests arrive during a cache miss, only one request fetches the data while others wait for the result.

    • Mutex : This mechanism that allows only one process or thread to access a shared resource at a time, while others wait. This prevents duplicate work, race conditions, and data corruption. It is also called request coalescing lock.

    • State while invalidate : Serve the old (stale) cached data immediately while refreshing the data in the background. Once the refresh completes, the cache is updated for future requests.

Like this there are plenty more strategies to prevent the thundering herd Problem.

Summary

So to summarize what we learnt so far is that the thundering herd problem happens when many requests react to the same event at once and overwhelm a shared resource, often during cache expiry or service recovery. Techniques like request coalescing, mutex locks, and stale-while-revalidate help prevent duplicate work and keep systems stable under spikes.

Hope you learned something useful from this post :) feel free to share it with your friends!

References: