What is a single point of failure?
My favorite example is the story of a boy who grows up in a remote desert location collecting water. He becomes the target of a totalitarian government who then kills his family. He leaves his home with an old family friend who tells him he has a special power. After being captured by the imperial government and escaping, he joins a rebel group fighting back. The story of course peaks with Luke using the Force to fire two proton torpedoes into the thermal exhaust port of the Death Star, exploiting its vulnerability and single point of failure, and destroying it.
A single point of failure is a component that brings an entire system down if it fails.
One of the most common single points of failure is people which we talked about last time regarding bus factor.
Let’s start out with a example of a single point of failure in technology.
Imagine a single traditional desktop computer or laptop with a connection to the internet that performs a task. This setup is full of single points of failure that would rendering it inoperable.
- Many individual components in the computer could fail.
- The power source might disappear.
- The internet provider could fail.
- The power cord could be damaged.
Any one of these would prevent the work from being performed. The most common components within a computer to fail are the hard drive and power supply. Let’s replace the computer with a more robust system representative of a typical server class system. Typically, these systems will carry multiple hard drives, dual network interfaces, and dual power supplies.
By doing this we reduce the risk of complete failure due to losing a hard drive and subsequently, data loss. We’ve also mitigated the risk of power supply failure by adding a second. Each power supply plugs into a separate circuit. If one circuit is overloaded and fails, the other keeps the system running. We can utilize the dual network interfaces to connect to two separate sources of connectivity.
In this scenario, we have mitigated most of the issues pointed out earlier and lowered our risk. However, there are still components of the server that could fail. If this task is critical to our business, how can we further lower our risk and protect the process from failure?
We’ll add another server configured the same way. These two servers could work in tandem and both process the work or they could be set up so that if one fails, the other picks up and does the work.
These systems still receive power from a single source. We’ll add a UPS (uninterruptible power supply) which is a smart battery backup, so if the power fails, the system will continue to run. Batteries are fairly short-lived. In most scenarios like this, they keep things running until a more capable power source takes over or systems shut down properly. Since this task is so critical to the business, we’ll add a generator. With the right sized generator and a solid supply of fuel, you can run indefinitely until power restores.
There’s still a potential problem. What if “the internet is down“? The network has internet connectivity through the same provider. We’ll add another communications provider. Now, the network is connected through multiple providers. If one fails, another is still able to handle the communications.
We’re not out of the woods yet! More than one data center has been taken offline by a backhoe operator that inadvertently exploited a single point of failure by cutting a conduit that carried cables providing power and connectivity.
Now the site itself is a single point of failure! What if a weather occurrence like flood, hurricane, or tornado shut down or destroyed the site? We’ll replicate this by adding a second site in another geographic location. Now if one site goes offline, we have another site where we can still process and conduct business.
Through the course of this example we’ve gone from a $1,000 computer connected to a commodity internet provider to a setup that is potentially millions of dollars. This just illustrates the trade off between cost and availability of the system.
Providers like Amazon Web Services have made this process a lot easier by commoditizing access to high-level data centers. It is much simpler to run multiple server instances across multiple geographic locations supported by robust data centers. Even then it is important to understand the risk in how your configuration will allow single points of failure.
How is software a single point of failure? Software components are designed to do only one thing. Can any one of those components take down your whole application?
If your application requires people to log in, the login is a single point of failure. In most cases, it is designed to be a single point of failure in that if someone can’t log in you don’t want them to use the application.
Another example could be the inability to process data. If your application loses the connection to its primary data store and is unable to save data effectively, it could break the application. A mitigation for this would be to institute a write queue. This technique allows your application to push data into a queue and the queuing system is then responsible for saving the data. If the database is unavailable, work will queue up until it is available again. A read-only database that has a last known good copy of the data could be utilized to make the application functional. In this instance, you’ve allowed your application to operate at a reduced capacity rather than being completely unavailable.
Many times I’ve seen a scenario where a dependency will break, or a third-party provider will become unavailable, and it affects the entire application. Perform tests with these components to understand how the application degrades and minimize the effect. If an email system is unavailable, while the components that send email will be affected, the rest of the application should be able to continue to function with a notification to the user.
I want to look at a related concept to a single point of failure. While I don’t consider a bottleneck to be a perfect example of a single point of failure, it’s worth mentioning. A process may be sufficiently redundant to lower the risk of it being completely unavailable. If that process is data intensive, it has a high risk of being a bottleneck. When the linear progression of work through the system goes through that stage, as the system scales, you’re at risk of that component hindering the operation of the entire system.
An often overlooked single point of failure is an external provider. If you are dependent on a service they provide or even built on top of a third-party, then be aware of the risks imposed on you with that decision.
If you allow users to log in with a social login such as logging in with Google or Facebook, understand that if those providers are unavailable or someone compromises an account on those systems, it will affect the ability of your users to access your application.
The security of your application is an important consideration.
A data breach or unauthorized access could have a catastrophic impact on your organization. A single point of failure would be an open server that requires a password to have full access. Structure layers of security where multiple points challenge a user. If the system restricts connectivity to a particular physical computer, that presents an initial challenge. That can be built on by further restricting who can log in to a server to trusted team members with separate authentication. Further, it requires more specific access privileges to allow those team members to operate at an administrative security level.
- Often, important account updates will require a user to re-enter their password.
- Resetting a password requires access to the user’s email.
- Two factor authentication requires a password as well as access to the user’s phone.
These examples are intended to help you understand how to identify and think through the mitigation or resolution of single point of failure.
Think about the system or process you are analyzing. Break it down into smaller components that make up the system. If the failure of one of those components would shut down the system or destroy it, then you have identified a single point of failure.
To address these risks, add another redundant option to that component. If adding a redundant component isn’t an option, contain the failure of the component so the effects don’t cascade through the system causing other failures or a system-wide failure.