When evaluating how to increase availability and reduce downtime for your deployments, solutions can commonly be categorized as either a 'High Availability' solution or a 'Fault Tolerant' solution. In this blog I thought I would take a moment to discuss pros and cons of each.
High Availability Solutions
High availability solutions traditionally consist of a set of loosely coupled servers which have failover capabilities. Each system is independent and self-contained, yet the servers are health monitoring each other and in the event of a failure, applications will be restarted on a different server in the pool of the cluster. Windows Server Failover Clustering is an example of a HA solution. HA solutions provide health monitoring and fault recovery to increase the availability of applications. A good way to think of it is that if a system crashes (like the power cord was pulled), the application very quickly restarts on another system. HA systems can recover in the magnitude of seconds, and can achieve five 9's of uptime (99.999%)... but they realistically can't deliver zero downtime. They also are flexible in that they enable recovery of any application running on any server in the cluster.
Fault Tolerant Solutions
Fault tolerant solutions traditionally consist of a pair of tightly coupled systems which provide redundancy. Generally speaking this involves running a single copy of the operating system and the application within, running consistently on two physical servers. The two systems are in lock step, so when any instruction is executed on one system, it is also executed on the secondary system. A good way to think of it is that you have two separate machines that are mirrored. In the event that the main system has a hardware failure, the secondary system takes over and there is zero downtime.
HA vs. FT
So which solution is right for you? Well, the initial and obvious conclusion most instantly come to is that 'no' downtime is better than 'some' downtime, so FT must be preferred over HA! Zero downtime is also the ultimate IT utopia which we all strive to achieve, which is goodness. Also FT is pretty cool from a technology perspective, so that tends to get the geek in all of us excited and interested.
However, it is important to understand they protect against different types of scenarios... and the key aspect to understand is what are the most important to you and your business requirements. It is true that FT solutions provide great resilience to hardware faults, such as if you walk up and yank the power cord out of the back of the server... the secondary mirror will take over with zero client downtime. However, remember that FT solutions are running a common operating system across those systems. In the event that there is a software fault (such as a hang or crash), both machines are affected and the entire solution goes down. There is no protection from software fault scenarios and at the same time you are doubling your hardware and maintenance costs. At the end of the day while a FT solution may promise zero downtime, it is in reality only to a small set of failure conditions. With a loosely coupled HA solution such as Failover Clustering, in the event of a hang or blue screen from a buggy driver or leaky application. Then the application will failover and recover on another system.
Another mitigating factor to remember is that most hardware components can be configured in a resilient fashion... such as dual NIC's with NIC Teaming, or dual HBA's with multi-path software, or systems inherit redundancy... such as redundant power supplies. So the fundamental question to ask yourself is how often are you walking through the datacenter and accidently trip over a power cord? Or how often do you have a non-redundant piece of hardware fail... such as a motherboard?
My perspective is that if you are having motherboards fail on a regular basis, it's probably time to find a new hardware vendor anyway. FT also comes at a cost... primarily performance degradation associated with keeping two systems synchronized, which can be significant. It is also increases the cost, in that you need to have redundant systems both actively consuming resources. That equates to a 100% extra resource requirement. Windows Server Failover Clustering allows for as little as 1 node of capacity reserved for failure/recovery, with up to 16 nodes in the cluster, which equates to 6.25% extra resource requirement.
So when you break it down and really start to think about the failure scenarios, if you have a critical app where performance doesn't matter and you are only worried about massive non-redundant hardware failures... then FT probably the better solution for you.
While a loosely coupled system such as Failover Clustering cannot deliver zero downtime for hardware failures, it does protect against a wider range of failures up and down the stack including hardware, OS, and even provides application health monitoring. HA solutions can also reduce downtime for other scenarios, such as patching. With Failover Clustering, you can move the application to another server when it comes time to patch the OS or application. Disclaimer: An application running in a VM guest OS on top of a virtualization HA platform does not provide the application health monitoring and application failover capabilities I've discussed here, but that's a topic for another blog. My core point is to be careful not to assume all "HA" branded solutions are equal, when comparing virtualization HA to OS clustering.
|
Windows Server Failover Clustering |
Fault Tolerant Solution |
Hardware Failure |
P |
P |
OS Level Failure |
P |
|
Application Failure |
P |
At the end of the day, there is no right or wrong answer. It boils down to you evaluating what your individual most common sources of downtime are, then deploying a solutions which helps mitigate them. Going back and analyzing the root cause of your sources of downtime over the last year is a good place to start, then you can come up with a strategy on what solution best mitigates them. Additionally, the business requirements vary for each deployment, so the service level agreement (SLA) you need to achieve, and what the acceptable levels of downtime for the failure conditions you need to protect against are ultimately up to you.
Thanks!
Elden Christensen
Senior Program Manager Lead
Clustering & High-Availability
Microsoft