How Risk Assessment Fails

This entry is part 2 of 8 in the series Risk Assessment

Fukushima Dai Ichi Power Plant after the explosionsThe events unfold­ing at Japan’s Fukushima Dai Ichi Nuclear Power plant are a case study in ways that the risk assess­ment pro­cess can fail or be abused. In an art­icle pub­lished on Bloomberg​.com, Jason Clenfield item­izes dec­ades of fraud and fail­ures in engin­eer­ing and admin­is­tra­tion that have led to the cata­stroph­ic fail­ure of four of six react­ors at the 40-​year-​old Fukushima plant. Clenfield’s art­icle, ‘Disaster Caps Faked Reports’, goes on to cov­er sim­il­ar fail­ures in the Japanese nuc­le­ar sector.

Most people believe that the more ser­i­ous the pub­lic danger, the more care­fully the risks are con­sidered in the design and exe­cu­tion of pro­jects like the Fukushima plant. Clenfield’s art­icle points to fail­ures by a num­ber of major inter­na­tion­al busi­nesses involved in the design and man­u­fac­ture of com­pon­ents for these react­ors that may have con­trib­uted to the cata­strophe play­ing out in Japan. In some cases, the cor­rect actions could have bank­rup­ted the com­pan­ies involved, so rather than risk fin­an­cial fail­ure, these fail­ures were covered up and the work­ers involved rewar­ded for their efforts. As you will see, some­times the degree of care that we have a right to expect is not the level of care that is used.

How does this relate to the fail­ure and abuse of the risk assess­ment pro­cess? Read on!

Risk Assessment Failures

Earthquake and Tsunami damage - Fukushima Dai Ichi Power PlantThe Fukushima Dai Ichi nuc­le­ar plant was con­struc­ted in the late 1960’s and early 1970’s, with Reactor #1 going on-​line in 1971. The react­ors at this facil­ity use ‘act­ive cool­ing’, requir­ing elec­tric­ally powered cool­ing pumps to run con­tinu­ously to keep the core tem­per­at­ures in the nor­mal oper­at­ing range. As you will have seen in recent news reports, the plant is loc­ated on the shore, draw­ing water dir­ectly from the Pacific Ocean.

Learn more about Boiling Water Reactors used at Fukushima.

Read IEEE Spectrum’s “24-​Hours at Fukushima”, a blow-​by-​blow account of the first 24 hours of the disaster.

Japan is loc­ated along one of the most act­ive fault lines in the world, with plate sub­duc­tion rates exceed­ing 90 mm/​year. Earthquakes are so com­mon­place in this area that the Japanese people con­sider Japan to be the ‘land of earth­quakes’, start­ing earth­quake safety train­ing in kindergarten.

Japan is the county that cre­ated the word ‘tsunami’ because the effects of sub-​sea earth­quakes often include large waves that swamp the shoreline. These waves affect all coun­tries bor­der­ing the worlds oceans, but are espe­cially pre­val­ent where strong earth­quakes are frequent.

In this envir­on­ment it would be reas­on­able to expect that con­sid­er­a­tion of earth­quake and tsunami effects would mer­it the highest con­sid­er­a­tion when assess­ing the risks related to these haz­ards. Remembering that risk is a func­tion of sever­ity of con­sequence and prob­ab­il­ity, the risk assessed from earth­quake and tsunami should have been crit­ic­al. Loss of cool­ing can res­ult in the cata­stroph­ic over­heat­ing of the react­or core, poten­tially lead­ing to a core meltdown.

The Fukushima Dai Ichi plant was designed to with­stand 5.7 m tsunami waves, even though a 6.4 m wave had hit the shore close by 10 years before the plant went on-​line. The wave gen­er­ated by the recent earth­quake was 7 m. Although the plant was not washed away by the tsunami, the wave cre­ated anoth­er problem.

Now con­sider that the react­ors require con­stant forced cool­ing using elec­tric­ally powered pumps. The backup gen­er­at­ors installed to ensure that cool­ing pumps remain oper­a­tion­al even if the mains power to the plant is lost, are installed in a base­ment sub­ject to flood­ing. When the tsunami hit the sea­wall and spilled over the top, the flood­wa­ters poured into the backup gen­er­at­or room, knock­ing out the dies­el backup gen­er­at­ors. The cool­ing sys­tem stopped. With no power to run the pumps, the react­or cores began to over­heat. Although the react­ors sur­vived the earth­quakes and the tsunami, without power to run the pumps the plant was in trouble.

Learn more about the accident.

Clearly there was a fail­ure of reas­on when assess­ing the risks related the loss of cool­ing cap­ab­il­ity in these react­ors. With sys­tems that are mis­sion crit­ic­al in the way that these sys­tems are, mul­tiple levels of redund­ancy bey­ond a single backup sys­tem are often the min­im­um required.

In anoth­er plant in Japan, a sec­tion of pip­ing car­ry­ing super­heated steam from the react­or to the tur­bines rup­tured injur­ing a num­ber of work­ers. The pipe was installed when the plant was new and had nev­er been inspec­ted since install­a­tion because it was left off the safety inspec­tion check­list. This is an example of a fail­ure that res­ul­ted from blindly fol­low­ing a check­list without look­ing at the lar­ger pic­ture. There can be no doubt that someone at the plant noticed that oth­er pipe sec­tions were inspec­ted reg­u­larly, but that this par­tic­u­lar sec­tion was skipped, yet no changes in the pro­cess resulted.

Here again, the risk was not recog­nized even though it was clearly under­stood with respect to oth­er sec­tions of pipe in the same plant.

In anoth­er situ­ation at a nuc­le­ar plant in Japan, drains inside the con­tain­ment area of a react­or were not plugged at the end of the install­a­tion pro­cess. As a res­ult, a small spill of radio­act­ive water was released into the sea instead of being prop­erly con­tained and cleaned up. The risk was well under­stood, but the con­trol pro­ced­ure for this risk was not implemented.

Finally, the Kashiwazaki Kariwa plant was con­struc­ted along a major fault line. The design­ers used fig­ures for the max­im­um seis­mic accel­er­a­tion that were three times lower than the accel­er­a­tions that could be cre­ated by the fault. Regulators per­mit­ted the plant to be built even though the rel­at­ive weak­ness of the design was known.

Failure Modes

I believe that there are a num­ber of reas­ons why these kinds of fail­ures occur.

People have a dif­fi­cult time appre­ci­at­ing the mean­ing of prob­ab­il­ity. Probability is a key factor in determ­in­ing the degree of risk from any haz­ard, yet when fig­ures like ‘1 in 1000’ or ‘1 x 10-5 occur­rences per year’ are dis­cussed, it’s hard for people to truly grasp what these num­bers mean. Likewise, when more sub­ject­ive scales are used it can be dif­fi­cult to really under­stand what ‘likely’ or ‘rarely’ actu­ally mean.

Consequently, even in cases where the sever­ity may be very high, the risk related to a par­tic­u­lar haz­ard may be neg­lected because the risk is deemed to be low because the prob­ab­il­ity seems to be low.

When prob­ab­il­ity is dis­cussed in terms of time, a fig­ure like ‘1 x 10-5 occur­rences per year’ can make the chance of an occur­rence seem dis­tant, and there­fore less of a concern.

Most risk assess­ment approaches deal with haz­ards singly. This is done to sim­pli­fy the assess­ment pro­cess, but the prob­lem that can res­ult from this approach is the effect that mul­tiple fail­ures can cre­ate, or that cas­cad­ing fail­ures can cre­ate. In a mul­tiple fail­ure con­di­tion, sev­er­al pro­tect­ive meas­ures fail sim­ul­tan­eously from a single cause (some­times called Common Cause Failure). In this case, back-​up meas­ures may fail from the same cause, res­ult­ing in no pro­tec­tion from the hazard.

In a cas­cad­ing fail­ure, an ini­tial fail­ure is fol­lowed by a series of fail­ures res­ult­ing in the par­tial or com­plete loss of the pro­tect­ive meas­ures, res­ult­ing in par­tial or com­plete expos­ure to the haz­ard. Reasonably fore­see­able com­bin­a­tions of fail­ure modes in mis­sion crit­ic­al sys­tems must be con­sidered and the prob­ab­il­ity of each estimated.

Combination of haz­ards can res­ult in syn­ergy between the haz­ards res­ult­ing in a high­er level of sever­ity from the com­bin­a­tion than is present from any one of the haz­ards taken singly. Reasonably fore­see­able com­bin­a­tions of haz­ards and their poten­tial syn­er­gies must be iden­ti­fied and the risk estimated.

Oversimplification of the haz­ard iden­ti­fic­a­tion and ana­lys­is pro­cesses can res­ult in over­look­ing haz­ards or under­es­tim­at­ing the risk.

Thinking about the Fukushima Dai Ichi plant again, the com­bin­a­tion of the effects of the earth­quake on the plant, with the added impact of the tsunami wave, res­ul­ted in the loss of primary power to the plant fol­lowed by the loss of backup power from the backup gen­er­at­ors, and the sub­sequent par­tial melt­downs and explo­sions at the plant. This com­bin­a­tion of earth­quake and tsunami was well known, not some ‘unima­gin­able’ or ‘unfore­see­able’ situ­ation. When con­duct­ing risk assess­ments, all reas­on­ably fore­see­able com­bin­a­tions of haz­ards must be considered.

Abuse and neglect

The risk assess­ment pro­cess is sub­ject to abuse and neg­lect. Risk assess­ment has been used by some as a means to jus­ti­fy expos­ing work­ers and the pub­lic to risks that should not have been per­mit­ted. Skewing the res­ults of the risk assess­ment, either by under­es­tim­at­ing the risk ini­tially, or by over­es­tim­at­ing the effect­ive­ness and reli­ab­il­ity of con­trol meas­ures can lead to this situ­ation. Decisions relat­ing to the ‘tol­er­ab­il­ity’ or the ‘accept­ab­il­ity’ of risks when the sever­ity of the poten­tial con­sequences are high should be approached with great cau­tion. In my opin­ion, unless you are per­son­ally will­ing to take the risk you are pro­pos­ing to accept, it can­not be con­sidered either tol­er­able or accept­able, regard­less of the leg­al lim­its that may exist.

In the case of the Japanese nuc­le­ar plants, the oper­at­ors have pub­licly admit­ted to falsi­fy­ing inspec­tion and repair records, some of which have res­ul­ted in acci­dents and fatalities.

In 1990, the US Nuclear Regulatory Commission wrote a report on the Fukushima Dai Ichi plant that pre­dicted the exact scen­ario that res­ul­ted in the cur­rent crisis. These find­ings were shared with the Japanese author­it­ies and the oper­at­ors, but no one in a pos­i­tion of author­ity took the find­ings ser­i­ously enough to do any­thing. Relatively simple and low-​cost pro­tect­ive meas­ures, like increas­ing the height of the pro­tect­ive sea wall along the coast­line and mov­ing the backup gen­er­at­ors to high ground could have pre­ven­ted a nation­al cata­strophe and the com­plete loss of the plant.

A Useful Tool

Despite these human fail­ings, I believe that risk assess­ment is an import­ant tool. Increasingly soph­ist­ic­ated tech­no­logy has rendered ‘com­mon sense’ use­less in many cases, because people do not have the expert­ise to have any com­mon sense about the haz­ards related to these technologies.

Where haz­ards are well under­stood, they should be con­trolled with the simplest, most dir­ect and effect­ive meas­ures avail­able. In many cases this can be done by the people who first identi­fy the hazard.

Where haz­ards are not well under­stood, bring­ing in experts with the know­ledge to assess the risk and imple­ment appro­pri­ate pro­tect­ive meas­ures is the right approach.

The com­mon aspect in all of this is the iden­ti­fic­a­tion of haz­ards and the applic­a­tion of some sort of con­trol meas­ures. Risk assess­ment should not be neg­lected simply because it is some­times dif­fi­cult, or it can be done poorly, or the res­ults neg­lected or ignored. We need to improve what we do with the res­ults of these efforts, rather than neg­lect to do them at all.

In the mean time, the Japanese, and the world, have some cleanup to do.

CSA Z1002 Public Review – Only 15 days left!

Only 15 days remain to get your thoughts sub­mit­ted on the draft of CSA Z1002. Do it now!

Today is Friday 4-​Mar-​2011, mark­ing 45 days into the pub­lic review peri­od for CSA Z1002 — Occupational Health and Safety Hazard Identification and Elimination and Risk Assessment and Control.

If you down­loaded the draft from the CSA web site, remem­ber that the PDF will lock on 17-​Mar, and you will no longer be able to do any­thing with it. If you haven’t looked at it yet, NOW IS THE TIME! Comments must also be sub­mit­ted by the 17th, so please sub­mit them as soon as pos­sible. No sub­mis­sions will be accep­ted after the 17th of March!

If you don’t have the draft already, get it here. Comments can be sub­mit­ted in the same place as you down­load the draft. DO NOT SUBMIT COMMENTS TO THIS BLOG.

If you need more inform­a­tion on the draft or on sub­mis­sion of com­ments, please con­tact the CSA Project Manager, Ms. Elizabeth Rankin, elizabeth.rankin’at’, +1 (416) 747‑2011.

Reader Question: Multiple E-​Stops and Resets

Control Panel with Emergency Stop Button.I had an inter­est­ing ques­tion come in from a read­er today that is rel­ev­ant to many situations:

When you have mul­tiple E-​Stop but­tons I have often got­ten into an argu­ment that says you can have a reset beside each one. I was taught that you were required to have a single point of reset. Who is correct?”

— Michael Barb, Sr. Electrical Engineer

The Short Answer

There is noth­ing in the EU, US or Canadian reg­u­la­tions that would for­bid hav­ing mul­tiple reset but­tons. However, you must under­stand the over­lap­ping require­ments for emer­gency stop and pre­ven­tion of unex­pec­ted start-up.

The Long Answer:

First I need to define two dif­fer­ent types of reset for clarity:

  1. Emergency Stop Device Reset: Each e-​stop device, i.e. but­ton, pull cord, foot switch, etc., is required to latch in the activ­ated state and must be indi­vidu­ally reset. Resetting the e-​stop device is NOT per­mit­ted to re-​start the machinery, only to per­mit restart­ing. (NFPA 79, CSA Z432, ISO 14118).
  2. Restarting the machine is a sep­ar­ate delib­er­ate action from reset­ting the emer­gency stop device(s).

ANSI B11-​2008 provides some dir­ect guid­ance on this topic:

7.2.2 Zones

A machine or an assembly of machines may be divided into sev­er­al con­trol zones (e.g., for emer­gency stop­ping, stop­ping as a res­ult of safe­guard­ing devices, start-​up, isol­a­tion or energy dis­sip­a­tion). The machine and con­trols in dif­fer­ent zones shall be defined and iden­ti­fied. Controls for machines in zones can be loc­al for each machine, across sev­er­al machines in a zone, or glob­ally for machines across zones. The con­trol require­ments shall be based on the oper­a­tion­al require­ments and on the risk assessment.The inter­faces between zones, includ­ing syn­chron­iz­a­tion and inde­pend­ent oper­a­tion, shall be designed such that no func­tion in one zone cre­ates a hazard(s) /​ haz­ard­ous situ­ation in anoth­er zone.

CSA Z432-​04 has sim­il­ar wording:

When zones can be determ­ined, their delim­it­a­tions shall be evid­ent (includ­ing the effect of the asso­ci­ated emer­gency stop device). This shall also apply to the effect of isol­a­tion and energy dissipation.

Let’s take a case with a single e-​stop but­ton first. The same require­ments apply for all e-​stop devices. The require­ments include:

  1. Button must be in ‘easy-​reach’ of the nor­mal oper­at­or pos­i­tion. I con­sider ‘easy-​reach’ to be the range I can touch while sit­ting or stand­ing at the nor­mal oper­at­or pos­i­tion. This pos­i­tion is not neces­sar­ily in front of the con­trol pan­el. This is the pos­i­tion where the oper­at­or is expec­ted to be while car­ry­ing out the tasks expec­ted of them when the machine is oper­at­ing. This is the require­ment that drives hav­ing mul­tiple but­tons in most cases.
  2. E-​stop devices can­not be loc­ated so that the oper­at­or must reach over or past a haz­ard to activ­ate them.
  3. The but­ton must latch in the oper­ated position.
  4. The but­ton must be robust enough to handle the mech­an­ic­al and elec­tric­al stresses that will be placed on it when used. i.e. rugged but­tons are required.
  5. When the e-​stop device is reset – i.e returned to the ‘RUN’ pos­i­tion – the machine is NOT per­mit­ted to restart. It is only PERMITTED to restart. It must be restar­ted through anoth­er delib­er­ate action, like press­ing a ‘Power On’ button.

So what do you do with the ‘POWER ON’ or safety cir­cuit reset but­ton? The first ques­tion to ask is: ‘What hap­pens when I reset this cir­cuit, apply­ing power to the con­trol circuits?”

Case A: If it is impossible to see the entire machine from the loc­a­tion of the reset but­ton, then I would recom­mend a single reset but­ton loc­ated at the HMI or main con­sole. The oper­at­or must check to make sure the machine is clear before re-​applying power. Where the machine is too big to be com­pletely vis­ible from the main oper­at­or con­sole, then I would also recommend:

  • warn­ing horn, 
  • warn­ing lights, and 
  • a start-​up delay that is long enough to allow a per­son to get clear of the machine before it starts moving.

Case B: If the machine is simply ‘enabled’ at this point, but no motion occurs, then mul­tiple ‘reset’ or ‘power on’ but­tons may be accept­able, depend­ing on the out­come of the risk assess­ment and start/​stop ana­lys­is. Having said that, the oper­at­or will likely have to return to a main con­sole to reset the machine and restart oper­a­tion, and chances are there is only one HMI screen on the machine, so there may not be any advant­age to hav­ing mul­tiple reset buttons.

I would recom­mend doing two things to get a good handle on this: Conduct a detailed risk assess­ment and include all nor­mal oper­a­tions and all main­ten­ance oper­a­tions. Then con­duct a start/​stop ana­lys­is to look at all of the start­ing and stop­ping con­di­tions that you can reas­on­ably fore­see. Combine the res­ults of these two ana­lyses to find the start­ing and stop­ping con­di­tions with the highest risk, and then determ­ine if hav­ing mul­tiple reset but­tons will con­trib­ute to the risk or not. You may also want to look at the con­trol reli­ab­il­ity require­ments for the emer­gency stop sys­tem based on the out­come of the risk assess­ment and the start/​stop analysis.

In a case where there are mul­tiple emer­gency stop devices, loc­a­tions are import­ant. There must be one at each nor­mal work­sta­tion to meet the reg­u­lat­ory require­ments in most jur­is­dic­tions, and with­in ‘easy reach’. You may also want some inside the machine if it is pos­sible to gain full body access inside the machinery. i.e. inside a robot work cell. Make sure that the but­tons or oth­er devices are loc­ated so that a per­son exposed to the hazard(s) inside the machine is not required to reach over or past the haz­ard to get to the button.

Michael, I hope that settles the argument!