Too Big Not to Fail

Published: April 30, 2011

Safety systems can sometimes create the problems they’re meant to prevent.

The recent string of serious nuclear accidents in Japan that followed the earthquake and tsunami on March 11 have again raised important questions about the reliability of high-risk technologies that have been woven into the fabric of our everyday lives. If humans are so technologically adept, why do these disasters still occur? Just how safe are we? Complex technological systems like nuclear power plants, aircraft, and off-shore oil rigs are useful when they operate as intended, but when they fail (and history shows that they persistently do at regular intervals), they have the unique potential to amplify danger and threaten the lives of hundreds, and sometimes thousands, of people.

As a general rule, governments and corporations responsible for building and operating these systems downplay the possibility of failure, and overstate their ability to manage disasters if they do occur. Citing extensive training, high-level engineering, and government regulations, authorities routinely describe these systems as “very safe” – which is a sneaky way of avoiding saying that they are actually still “a little unsafe.”

One way that authorities rationalize living with high-risk technologies is by fostering faith in automatic safety systems and redundant backup systems that are in place to prevent catastrophes. In the unlikely event of a failure in a primary system, these backups and fail-safe components are supposed to take over and save the day. Or so the story goes.

However, as the unfolding situation in Japan has once again shown, our trust in safety and backup systems is naïve, at best. These recent nuclear accidents – the most publicized being the explosions and radiation leaks at the Fukushima Daiichi nuclear power plant – show that systems don’t always operate as intended, and that safety systems sometimes have the ironic effect of enabling or worsening dangerous situations. Since people are regularly coerced into betting their lives on the fact that technological systems will function as designed, it is important to investigate instances when these safeguards have failed.

The history of technological failure shows that safety systems in high-risk technologies have contributed to accidents a number of times, and in a number of ways: through their design, by adding complexity, and through simple component failure.

When the earthquake and accompanying tsunami struck the Fukushima Daiichi nuclear power plant, two important safety features failed. First, the initial 9.0 earthquake triggered automatic safety systems that shut down the chain reactions in each of the plant’s three operational units, but the violent shake also disconnected the plant from the country’s main electrical power grid. This caused on-site backup generators to take over powering the crucial cooling of the extremely hot fuel rods.

Then, roughly 15 minutes after this happened, a 14-metre-high tsunami wave swept over the seawall, which had been built as a safety feature, but was only designed to protect the plant from waves measuring six metres or less. The wave flooded the plant, disabling the backup generators that were powering the cooling process and control systems. Like a boiling kettle taken off a burner and then plugged up, the reactor’s core needed to undergo a lengthy cooling process to avoid a pressure buildup and explosion. With the emergency cooling system defeated by the protective sea wall’s limited design, liquid intended to cool the core instead heated up, evaporating into dangerous radioactive air, which then needed to be vented. Explosions later occurred at units one and three – the exact causes of which are still uncertain.

This example shows that it is impossible to design safety systems that take all possible failure scenarios into account. The world is far more inventive than we are. This is especially true when multiple failures – in this case, the failures of the primary electrical system and the seawall – interact in a complex system. The seawall safety feature was not designed to counter a disaster of this magnitude, and the backup power systems were not designed with the complex effects of this second failure in mind.

Another way that safety systems contribute to dangerous accidents is by making highly interactive systems even more complex, creating more chances for a failure to occur, and more ways for it to spread. Complex systems like airplanes and nuclear power plants are non-linear. They are set up so that thousands of components interact quickly, in various sequences, with many other components, and in many combinations. Failures in these types of systems – even trivial ones – can propagate throughout the systems, triggering problems in seemingly unrelated areas, and making the systems act in incomprehensible and unanticipated ways. Sometimes, these trivial failures begin in systems added for safety or redundancy, and lead to much larger problems elsewhere. In fact, this is what initiated the Three Mile Island nuclear accident in 1979.

The accident at Three Mile Island started with a trivial failure in the secondary cooling system. Moisture from the cooling system leaked into a pneumatic control system, which accidentally (and unnecessarily) triggered an automatic safety device that shut down pumps supplying water to cool the reactor. This triggered another safety device – an emergency-backup pump system – to take over. Unfortunately, two pipes that were part of this system were left in the “closed” position, and a blocked indicator light on the plant’s massive control board prevented operators from knowing this.

The operators could see that the backup pumps were working, but they didn’t know that they were pumping water into closed pipes. With no liquid circulating to cool the core, the reactor automatically shut down. However, the core still needed to be cooled, and, with the cooling system offline, another automatic safety device called a pilot-operated relief valve opened to allow a reserve supply of water to flow in.

However, this safety device failed to reseat (to close after a specified amount of time), and water continued to flood the reactor. In the control room, a safety indicator light, which was supposed to warn operators if the relief valve failed to close, failed to activate, misleading the operators into thinking everything was alright.

All of these failures occurred within a matter of seconds, and created a situation that the designers never anticipated, and that was incomprehensible to the operators in the moment. This illustrates how trivial problems can quickly spread in complex systems, and how features added for safety can, once a failure begins to spread, actually amplify the danger.

Besides problems with design and complexity, safety systems are also subject to mundane reliability issues. Take the example of the faulty indicator light mentioned above. A warning light failed, keeping controllers in the dark about the stuck valve. Instruments like these are essential to monitoring and operating large technological systems, but they can also be deceptive. As seen here, they can withhold information or, conversely, they can erroneously display a warning when, in fact, everything is normal. Conflicting data from different instruments is also a problem: If one instrument registers a problem while another indicates everything is OK, how can an operator know which one is correct? In the midst of quickly unfolding accidents, instrument failures only add to the confusion and cause further mistakes to be made.

Obviously, safety devices and redundant backup systems are needed. They have saved countless lives. But that’s not a good reason to trust them unconditionally. Too often, they are evoked uncritically in official rhetoric in order to downplay the risks that silently remain. Just because safety systems exist is no guarantee that they will be adequate for all possible scenarios, or that they will function as intended once a baffling system accident begins to unfold.

Photo courtesy of Reuters.