Fault Tolerance on the Cheap: How to Build Systems that Probably Won't Fall Over Part II

Written by: Brian Troutwine

Welcome to the second part of "Fault Tolerance on the Cheap." In the previous article, we discussed two approaches organizations take toward building fault-tolerant systems:

  1. Perfection -- projects with extensive specification and long development cycles reduce faults up front while taking measures to mitigate understood failure modes

  2. Hope for the Best -- projects that go out into the world on short development cycles and occasionally require heroic effort to keep online

Implicit in these two approaches is a degree of crisis when things go wrong. A dropped request on a social media website is not catastrophic to the same degree that the loss of a flight computer holding re-entry attitudes is. However, if you pile up enough dropped requests over a long enough time period, this begins to add up to real money and that becomes a crisis.

There is a class of software in which faults are important but in some limited fashion tolerable. It's this class of software we'll discuss in this article.

Embracing Failure

A system that is built embracing faults is something of a hybrid of the two approaches previously discussed.

There is moderate up front design, but the system is placed in service long before a rigidly specified system would begin development. This means that the problem domain and system goals are well understood in part and to be discovered during operation otherwise. That is, faults will occur in production, and the system must be designed in a modular manner with sub-systems linked along interfaces assumed to be failure prone.

(Current literature on microservice architecture in a networked environment stresses this quite well, as has literature on real-time embedded systems which isolates sub-systems across buses for some time.)

A system so designed will sometimes have faults which must be worked live. To do this well, the system must be designed for introspection, the exact form of which varies but quite often will be some combination of logs, real-time telemetry, and small, easily comprehended sub-systems.

There are four conceptual stages to a fault-embracing system, growing from smallest to largest.

1: Component

In her 1993 paper "Analyzing Software Requirements Errors in Safety Critical Embedded Systems," Dr. Robyn R. Lutz examined the errors uncovered in the Voyager and Galileo spacecraft during testing. She wrote that:

Functional faults (operating, conditional, or behavioral discrepancies with the functional requirements) are the most common kind of software error.

She noted that there were "few internal faults (e.g., coding errors integral to a software module)" and that it appears coding errors are "being detected and corrected before system testing begins."

What Dr. Lutz refers to as a 'module,' I call a 'component' in this article. The 'component' level is the most atomic of the system. These are single-responsibility, do-one-thing-and-do-it-well kind of things. In an OOP system, classes would easily fill this role; small libraries with limited APIs do as well.

In the Structured Programming era, the name of the game was 'routines', functions operating over abstract datatypes. The exact nature varies by application, but what matters is interface. If the internal workings are not exposed and interaction is limited according to some contract -- possibly well-defined, possibly not -- you have a 'component'. These are combined to form the larger system.

Progress toward removing or mitigating faults at the component level has an outsized impact on the good operation of higher levels. Following Dr. Lutz' research, the location of concern in a robust system should be along the interfaces between components.

Immutable data structures, isolation of side-effects into well-understood places, compile-time guarantees, and the application of formal program analysis (or its cousin twice removed, type-guided randomized testing) are all hugely impactful here. The intention of each of these techniques is to reduce the cognitive surface of a component by reducing unintended interactions or purposefully limiting possible behavior.

Looked at in a certain way, these techniques combined are Functional Programming, but this isn't strictly true. The Spark programming language has very amenable formal methods and has compile-time guarantees commensurate to its role as a safety-critical programming language while in no sense being Functional.

Understandable and simple, maybe even aggressively so, is the name of the game at the lowest system level: 'machine'.

2: Machine

Considering the composition of components, call this system level 'machine'. This may or may not map directly to a real, physical computer, but it is a helpful abstraction.

The 'machine' level is where components' interactions are exercised and faults in their interactions are to be found. There are likely to be two classes of fault that occur at this level: systematic misunderstandings of requirements, per Lutz, and transient faults.

Transient faults are interesting in that, for some input, they exercise some code path which doesn't behave as expected. These are 'bugs' that might have been caught in testing had there been more time and rigorous process.

Joe Armstrong's 2003 thesis "Making Reliable Distributed Systems in the Presence of Software Errors" covers this class of fault. He argued that the "philosophy for handling errors can be expressed in a number of slogans:

  • Let some other process do the error recovery.

  • If you can’t do what you want to do, die.

  • Let it crash.

  • Do not program defensively."

That is, in a reliable system, faults are not meant to be corrected automatically but contained. The correction is pushed off onto some future work, but the system itself remains online and able to service traffic.

How you contain faults depends on the composition of the system. Armstrong's thesis is explicitly about the Erlang programming language, which internalizes message passing interfaces arranged around supervised, restartable components. George Candea and Armando Fox's paper, "Crash-Only Software," works in this same vein, though not tied to any particular programming language.

Both papers maintain a need to detect failed sub-systems and restart them at will. Addressing components only by name, abstracting the unique instances of a component behind something invariable, admits this crash/restart strategy along "externally enforced boundaries," as Candea puts it.

Components must be designed to handle the temporary loss of a cooperative component, adding some complication to the system while increasing robustness in the face of transient faults. Readers operating in a microservice architecture will recognize this broad approach.

3: Cluster

Once you have a machine that's capable of isolating faults in itself, you're exposed to catastrophic failure of the machine.

At this point, we've wandered into more familiar territory. Common wisdom here advocates redundant machines arranged such that there are no single points of failure. What is less commonly understood is the exposure of systems to low-probability events, or chain accidents.

Charles Perrow called these "normal accidents" in his "Normal Accidents: Living with High-Risk Technologies." These are faults that compound, which, taken in isolation, would not be an issue.

Consider a common web application host blinking on and off, connecting both to a system-shared database and a system-local consensus service for discovery. In isolation, you'll see database TCP connections dropping and reconnecting, user requests failing on occasion, and adjustments to the record in the consensus service. But, perhaps, blinking hosts exercise some fault in the consensus service which, over time, causes the service to fail to achieve consensus.

Application hosts, assuming perfect source of truth from the service, slowly converge from one another in their understanding of the world, fall out of sync, exercise some bad code path, and begin blinking on and off. As the blinking increases, the database load in connecting and disconnecting clients rises until some tipping point is reached and the database becomes unavailable, denying the remaining healthy hosts the ability to service traffic.

In Perrow's analysis, we guard against normal accidents by identifying systems with them present and refusing to build them if they are severe enough. This is good advice. If the severity threshold is not reached, two behaviors will stem but not totally resolve normal accidents:

  • extensive system-telemetry and monitoring of such

  • comprehensive mean time between failure analysis

In our hypothetical, the accident might have been recognized as building had the web application hosts reported their database reconnect attempts per some smallish time period, secondarily if the hosts reported the last-updated time for their local consensus services. There are many products and companies on the market to help with this particular challenge.

More difficult is the creation of component MTBF estimates. This is driven, in part, by telemetry, lived experience, and tolerance testing through artificial load generation. Each of these are time consuming and not exactly cheap, leading us to our last conceptual stage.

4: Organization

No technical system exists in a vacuum. Each is influenced by the human organization that created and sustains it. A finely built machine without a supporting organization is a disaster waiting to happen.

The Chernobyl disaster, the loss of space shuttle Challenger, the 1980 Damascus Incident -- each of these is centered in the failure of the human organization surrounding the system to sustain efforts at maintenance or to avoid exceeding safety tolerances of the technical system for the benefit of political goals. Boiled water reactors are inherently unsafe but, treated as dangerous, can be maintained over long periods. Chernobyl's safety guidelines were gradually relaxed to serve the needs of ambitious plant managers. In the Damascus Incident, missile safety standards were gradually relaxed through misplaced confidence bred from familiarity with the missile and a lack of strict oversight on the missile's regular maintenance.

To sustain a system against faults, its surrounding organization must be capable of supporting work to correct conditions that allowed mistakes, as well as correcting the mistakes themselves. Propping open blast doors happens out of laziness, but this also happens due to pressure for faster refueling. The organization must also be willing and able to support experts in the system, building tools to make their work more effective, and creating processes around the system that accepts their feedback. A strict separation of concerns -- billing is responsible only for sustaining billing, not also, say, push notifications -- helps guide the political system away from dancing on the safety tolerance line.

Ultimately, no matter how hard engineers work, a system cannot be constructed that will operate reliably without buy-in from the organization that commissioned it. It's work that cannot be done in isolation, through force of will, or technical capability.

Stay up to date

We'll never share your email address and you can opt out at any time, we promise.