Dealing with correlated failures

While redundancy is a strategy that is widely used to to achieve fault tolerance, it is only helpful when your nodes can't fail for the same reason at the same time.

Imagine a case where all of your nodes are deployed in the same data center. A data center wide power outage would make all of your nodes fail and your application unavailable.

The solution is simple, we reduce the number of correlated issues by using multiple data centers.

If you've ever wondered what availability zones are (AZs), you now have an idea. They are data centers in a single region. AZs are far enough from each other to minimize the risk of correlated failures, but close enough to keep latency extremely low.

If you have stateless services, it is rather simple. You would have instances running in multiple AZs and a shared load balancer for them. If an AZ goes down, your load balancer stops forwarding traffic to it, but the rest of your AZs continue working as expected. Stateful services require you to have some kind of replication in place to keep them in sync. Since the latencies are low, you could use partially synchronous or fully synchronous protocols.

In an extreme event, all data centers in a region could be hit by some catastrophic event. You still have the option to use data centers in different regions, although they are more frequently used for legal reasons.