Preconditions for redundancy

Redundancy is one of the widely used strategies to achieve fault tolerance. It is used for both stateful and stateless services. It is also one of the main reasons why distributed systems can achieve better performance than single node systems.

There are some preconditions on using redundancy:

The complexity added by introducing redundancy must not cost more availability than it adds
The system must reliably detect which of the redundant components are healthy and unhealthy
The system must be able to run in degraded mode
The system must be able to return to a fully redundant mode

While these preconditions seem obvious, they can be difficult to achieve, especially for stateful systems.

Let's look at a stateless example. Stateful examples will be a part of following blogs.

Let's assume we have a pool of redundant nodes behind a load balancer. A load balancer can mask many faults like hardware faults, memory and network issues etc by routing requests to a healthy node.

The load balancer needs to be able to detect healthy and unhealthy nodes. It usually does that by using health checks. The longer it takes the load balancer to detect unresponsive nodes, the longer that node will serve the clients.

When a load balancer takes an unhealthy node from the pool, the assumption is that other nodes in the pool have enough capacity to serve the load. This means that the system is able to run in degraded mode.

Additionally the system must be able to add new nodes to the pool, otherwise at some point there might not be enough nodes to serve the load.

Here is a good resource on this topic.

← AI in the software industry