Emergency Stop Failures

This entry is part 13 of 13 in the series Emer­gency Stop

I am always look­ing for inter­est­ing exam­ples of machin­ery safe­ty prob­lems to share on MS101. Recent­ly I was scrolling Reddit/r/OSHA and found these three real-world exam­ples.

Broken Emergency Stop Buttons

The first and most obvi­ous kinds of fail­ures are those result­ing from either wear out or dam­age to emer­gency stop devices like e-stop but­tons or pull cords. Here’s a great exam­ple:

Won’t be stop­ping this ele­va­tor any­time soon. from OSHA

The oper­a­tor device in this pic­ture has two prob­lems:

1) the but­ton oper­a­tor has failed and

2) the e-stop is incor­rect­ly marked.

The cor­rect mark­ing would be a yel­low back­ground in place of the red/silver leg­end plate, like the exam­ple below. The yel­low back­ground could have the words “emer­gency stop” on it, but this is not nec­es­sary as the colour com­bi­na­tion is enough.

Yellow circular legend plate with the words "emergency stop" in black letters. Fits A-B 800T pushbutton operators.
Allen-Bradley 800T Emer­gency Stop leg­end plate

There is an ISO/IEC sym­bol for an emer­gency stop that could also be used [1].

Emergency stop symbol. A circle containing an equalateral triangle pointing downward, containing an exclamation mark.
Emer­gency Stop Sym­bol IEC 60417–5638 [1]
I won­der how the con­tact block(s) inside the enclo­sure are doing? Con­tact blocks have been known to fall off the back of emer­gency stop oper­a­tor but­tons, leav­ing you with a but­ton that does noth­ing when pressed. Con­tact blocks secured with screws are most vul­ner­a­ble to this kind of fail­ure. Los­ing a con­tact block like this hap­pens most often in high-vibra­tion con­di­tions. I have run across this in real life while doing inspec­tions on client sites.

There are con­tact blocks made to detect this kind of fail­ure, like Allen Bradley’s self-mon­i­tor­ing con­tact block, 800TC-XD4S, or the sim­i­lar Siemens prod­uct,3SB34. Most con­trols com­po­nent man­u­fac­tur­ers will be like­ly to have sim­i­lar com­po­nents.

Here’s anoth­er exam­ple from a machine inspec­tion I did a while ago. Note the wire “keep­er” that pre­vents the but­ton from get­ting lost!


Instal­la­tion Fail­ures

Here is an exam­ple of poor plan­ning when installing new bar­ri­er guards. The emer­gency stop but­ton is now out of reach. The orig­i­nal poster does not indi­cate a rea­son why the emer­gency stop for the machine he was oper­at­ing was mount­ed on a dif­fer­ent machine.

sure hope i nev­er need to hit that emer­gency stop but­ton. its for the machine on my side of the new fence. from OSHA

No Emergency Stop at all

Final­ly, and pos­si­bly the worst exam­ple of all. Here is an impro­vised emer­gency stop using a set of wire cut­ters. No fur­ther com­ment required.

Emer­gency stop but­ton. from OSHA

If you have any exam­ples you would like to share, feel free to add them in com­ments below. Ref­er­ences to par­tic­u­lar employ­ers or man­u­fac­tur­ers will be delet­ed before posts are approved.

References

[1]     “IEC 60417–5638, Emer­gency Stop”, Iso.org, 2017. [Online]. Avail­able: https://www.iso.org/obp/ui/#iec:grs:60417:5638. [Accessed: 27- Jun- 2017].

ISO 13849–1 Analysis — Part 6: CCF — Common Cause Failures

This entry is part 6 of 9 in the series How to do a 13849–1 analy­sis

What is a Common Cause Failure?

There are two sim­i­lar-sound­ing terms that peo­ple often get con­fused: Com­mon Cause Fail­ure (CCF) and Com­mon Mode Fail­ure. While these two types of fail­ures sound sim­i­lar, they are dif­fer­ent. A Com­mon Cause Fail­ure is a fail­ure in a sys­tem where two or more por­tions of the sys­tem fail at the same time from a sin­gle com­mon cause. An exam­ple could be a light­ning strike that caus­es a con­tac­tor to weld and simul­ta­ne­ous­ly takes out the safe­ty relay proces­sor that con­trols the con­tac­tor. Com­mon cause fail­ures are there­fore two dif­fer­ent man­ners of fail­ure in two dif­fer­ent com­po­nents, but with a sin­gle cause.

Com­mon Mode Fail­ure is where two com­po­nents or por­tions of a sys­tem fail in the same way, at the same time. For exam­ple, two inter­pos­ing relays both fail with weld­ed con­tacts at the same time. The fail­ures could be caused by the same cause or from dif­fer­ent caus­es, but the way the com­po­nents fail is the same.

Com­mon-cause fail­ure includes com­mon mode fail­ure, since a com­mon cause can result in a com­mon man­ner of fail­ure in iden­ti­cal devices used in a sys­tem.

Here are the for­mal def­i­n­i­tions of these terms:

3.1.6 com­mon cause fail­ure CCF

fail­ures of dif­fer­ent items, result­ing from a sin­gle event, where these fail­ures are not con­se­quences of each oth­er

Note 1 to entry: Com­mon cause fail­ures should not be con­fused with com­mon mode fail­ures (see ISO 12100:2010, 3.36). [SOURCE: IEC 60050?191-am1:1999, 04–23.] [1]

 

3.36 com­mon mode fail­ures

fail­ures of items char­ac­ter­ized by the same fault mode

NOTE Com­mon mode fail­ures should not be con­fused with com­mon cause fail­ures, as the com­mon mode fail­ures can result from dif­fer­ent caus­es. [lEV 191–04-24] [3]

The “com­mon mode” fail­ure def­i­n­i­tion uses the phrase “fault mode”, so let’s look at that as well:

fail­ure mode
DEPRECATED: fault mode
man­ner in which fail­ure occurs

Note 1 to entry: A fail­ure mode may be defined by the func­tion lost or oth­er state tran­si­tion that occurred. [IEV 192–03-17] [17]

As you can see, “fault mode” is no longer used, in favour of the more com­mon “fail­ure mode”, so it is pos­si­ble to re-write the com­mon-mode fail­ure def­i­n­i­tion to read, “fail­ures of items char­ac­terised by the same man­ner of fail­ure.”

Random, Systematic and Common Cause Failures

Why do we need to care about this? There are three man­ners in which fail­ures occur: ran­dom fail­ures, sys­tem­at­ic fail­ures, and com­mon cause fail­ures. When devel­op­ing safe­ty relat­ed con­trols, we need to con­sid­er all three and mit­i­gate them as much as pos­si­ble.

Ran­dom fail­ures do not fol­low any pat­tern, occur­ring ran­dom­ly over time, and are often brought on by over-stress­ing the com­po­nent, or from man­u­fac­tur­ing flaws. Ran­dom fail­ures can increase due to envi­ron­men­tal or process-relat­ed stress­es, like cor­ro­sion, EMI, nor­mal wear-and-tear, or oth­er over-stress­ing of the com­po­nent or sub­sys­tem. Ran­dom fail­ures are often mit­i­gat­ed through selec­tion of high-reli­a­bil­i­ty com­po­nents [18].

Sys­tem­at­ic fail­ures include com­mon-cause fail­ures, and occur because some human behav­iour occurred that was not caught by pro­ce­dur­al means. These fail­ures are due to design, spec­i­fi­ca­tion, oper­at­ing, main­te­nance, and instal­la­tion errors. When we look at sys­tem­at­ic errors, we are look­ing for things like train­ing of the sys­tem design­ers, or qual­i­ty assur­ance pro­ce­dures used to val­i­date the way the sys­tem oper­ates. Sys­tem­at­ic fail­ures are non-ran­dom and com­plex, mak­ing them dif­fi­cult to analyse sta­tis­ti­cal­ly. Sys­tem­at­ic errors are a sig­nif­i­cant source of com­mon-cause fail­ures because they can affect redun­dant devices, and because they are often deter­min­is­tic, occur­ring when­ev­er a set of cir­cum­stances exist.

Sys­tem­at­ic fail­ures include many types of errors, such as:

  • Man­u­fac­tur­ing defects, e.g., soft­ware and hard­ware errors built into the device by the man­u­fac­tur­er.
  • Spec­i­fi­ca­tion mis­takes, e.g. incor­rect design basis and inac­cu­rate soft­ware spec­i­fi­ca­tion.
  • Imple­men­ta­tion errors, e.g., improp­er instal­la­tion, incor­rect pro­gram­ming, inter­face prob­lems, and not fol­low­ing the safe­ty man­u­al for the devices used to realise the safe­ty func­tion.
  • Oper­a­tion and main­te­nance, e.g., poor inspec­tion, incom­plete test­ing and improp­er bypass­ing [18].

Diverse redun­dan­cy is com­mon­ly used to mit­i­gate sys­tem­at­ic fail­ures, since dif­fer­ences in com­po­nent or sub­sys­tem design tend to cre­ate non-over­lap­ping sys­tem­at­ic fail­ures, reduc­ing the like­li­hood of a com­mon error cre­at­ing a com­mon-mode fail­ure. Errors in spec­i­fi­ca­tion, imple­men­ta­tion, oper­a­tion and main­te­nance are not affect­ed by diver­si­ty.

Fig 1 below shows the results of a small study done by the UK’s Health and Safe­ty Exec­u­tive in 1994 [19] that sup­ports the idea that sys­tem­at­ic fail­ures are a sig­nif­i­cant con­trib­u­tor to safe­ty sys­tem fail­ures. The study includ­ed only 34 sys­tems (n=34), so the results can­not be con­sid­ered con­clu­sive. How­ev­er, there were some star­tling results. As you can see, errors in the spec­i­fi­ca­tion of the safe­ty func­tions (Safe­ty Require­ment Spec­i­fi­ca­tion) result­ed in about 44% of the sys­tem fail­ures in the study. Based on this small sam­ple, sys­tem­at­ic fail­ures appear to be a sig­ni­fi­cate source of fail­ures.

Pie chart illustrating the proportion of failures in each phase of the life cycle of a machine, based on data taken from HSE Report HSG238.
Fig­ure 1 — HSG 238 Pri­ma­ry Caus­es of Fail­ure by Life Cycle Stage

Handling CCF in ISO 13849–1

Now that we under­stand WHAT Com­mon-Cause Fail­ure is, and WHY it’s impor­tant, we can talk about HOW it is han­dled in ISO 13849–1. Since ISO 13849–1 is intend­ed to be a sim­pli­fied func­tion­al safe­ty stan­dard, CCF analy­sis is lim­it­ed to a check­list in Annex F, Table F.1. Note that Annex F is infor­ma­tive, mean­ing that it is guid­ance mate­r­i­al to help you apply the stan­dard. Since this is the case, you could use any oth­er means suit­able for assess­ing CCF mit­i­ga­tion, like those in IEC 61508, or in oth­er stan­dards.

Table F.1 is set up with a series of mit­i­ga­tion mea­sures which are grouped togeth­er in relat­ed cat­e­gories. Each group is pro­vid­ed with a score that can be claimed if you have imple­ment­ed the mit­i­ga­tions in that group. ALL OF THE MEASURES in each group must be ful­filled in order to claim the points for that cat­e­go­ry. Here’s an exam­ple:

A portion of ISO 13849-1 Table F.1.
ISO 13849–1:2015, Table F.1 Excerpt

In order to claim the 20 points avail­able for the use of sep­a­ra­tion or seg­re­ga­tion in the sys­tem design, there must be a sep­a­ra­tion between the sig­nal paths. Sev­er­al exam­ples of this are giv­en for clar­i­ty.

Table F.1 lists six groups of mit­i­ga­tion mea­sures. In order to claim ade­quate CCF mit­i­ga­tion, a min­i­mum score of 65 points must be achieved. Only Cat­e­go­ry 2, 3 and 4 archi­tec­tures are required to meet the CCF require­ments in order to claim the PL, but with­out meet­ing the CCF require­ment you can­not claim the PL, regard­less of whether the design meets the oth­er cri­te­ria or not.

One final note on CCF: If you are try­ing to review an exist­ing con­trol sys­tem, say in an exist­ing machine, or in a machine designed by a third par­ty where you have no way to deter­mine the expe­ri­ence and train­ing of the design­ers or the capa­bil­i­ty of the company’s change man­age­ment process, then you can­not ade­quate­ly assess CCF [8]. This fact is recog­nised in CSA Z432-16 [20], chap­ter 8. [20] allows the review­er to sim­ply ver­i­fy that the archi­tec­tur­al require­ments, exclu­sive of any prob­a­bilis­tic require­ments, have been met. This is par­tic­u­lar­ly use­ful for engi­neers review­ing machin­ery under Ontario’s Pre-Start Health and Safe­ty require­ments [21], who are fre­quent­ly work­ing with less-than-com­plete design doc­u­men­ta­tion.

In case you missed the first part of the series, you can read it here. In the next arti­cle in this series, I’m going to review the process flow for sys­tem analy­sis as cur­rent­ly out­lined in ISO 13849–1. Watch for it!

Book List

Here are some books that I think you may find help­ful on this jour­ney:

[0]     B. Main, Risk Assess­ment: Basics and Bench­marks, 1st ed. Ann Arbor, MI USA: DSE, 2004.

[0.1]  D. Smith and K. Simp­son, Safe­ty crit­i­cal sys­tems hand­book. Ams­ter­dam: Else­vier/But­ter­worth-Heine­mann, 2011.

[0.2]  Elec­tro­mag­net­ic Com­pat­i­bil­i­ty for Func­tion­al Safe­ty, 1st ed. Steve­nage, UK: The Insti­tu­tion of Engi­neer­ing and Tech­nol­o­gy, 2008.

[0.3]  Overview of tech­niques and mea­sures relat­ed to EMC for Func­tion­al Safe­ty, 1st ed. Steve­nage, UK: Overview of tech­niques and mea­sures relat­ed to EMC for Func­tion­al Safe­ty, 2013.

References

Note: This ref­er­ence list starts in Part 1 of the series, so “miss­ing” ref­er­ences may show in oth­er parts of the series. The com­plete ref­er­ence list is includ­ed in the last post of the series.

[1]     Safe­ty of machin­ery — Safe­ty-relat­ed parts of con­trol sys­tems — Part 1: Gen­er­al prin­ci­ples for design. 3rd Edi­tion. ISO Stan­dard 13849–1. 2015.

[2]     Safe­ty of machin­ery — Safe­ty-relat­ed parts of con­trol sys­tems — Part 2: Val­i­da­tion. 2nd Edi­tion. ISO Stan­dard 13849–2. 2012.

[3]      Safe­ty of machin­ery — Gen­er­al prin­ci­ples for design — Risk assess­ment and risk reduc­tion. ISO Stan­dard 12100. 2010.

[8]     S. Joce­lyn, J. Bau­doin, Y. Chin­ni­ah, and P. Char­p­en­tier, “Fea­si­bil­i­ty study and uncer­tain­ties in the val­i­da­tion of an exist­ing safe­ty-relat­ed con­trol cir­cuit with the ISO 13849–1:2006 design stan­dard,” Reliab. Eng. Syst. Saf., vol. 121, pp. 104–112, Jan. 2014.

[17]      “fail­ure mode”, 192–03-17, Inter­na­tion­al Elec­trotech­ni­cal Vocab­u­lary. IEC Inter­na­tion­al Elec­trotech­ni­cal Com­mis­sion, Gene­va, 2015.

[18]      M. Gen­tile and A. E. Sum­mers, “Com­mon Cause Fail­ure: How Do You Man­age Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331–338, 2006.

[19]     Out of Control—Why con­trol sys­tems go wrong and how to pre­vent fail­ure, 2nd ed. Rich­mond, Sur­rey, UK: HSE Health and Safe­ty Exec­u­tive, 2003.

[20]     Safe­guard­ing of Machin­ery. 3rd Edi­tion. CSA Stan­dard Z432. 2016.

[21]     O. Reg. 851, INDUSTRIAL ESTABLISHMENTS. Ontario, Cana­da, 1990.

Testing Emergency Stop Systems

This entry is part 11 of 13 in the series Emer­gency Stop

Emergency Stop on machine consoleI’ve had a num­ber of ques­tions from read­ers regard­ing test­ing of emer­gency stop sys­tems, and par­tic­u­lar­ly with the fre­quen­cy of test­ing. I addressed the types of tests that might be need­ed in anoth­er arti­cle cov­er­ing Check­ing Emer­gency Stop Sys­tems. This arti­cle will focus on the fre­quen­cy of test­ing rather than the types of tests.

The Problem

Emer­gency stop sys­tems are con­sid­ered to be “com­ple­men­tary pro­tec­tive mea­sures” in key machin­ery safe­ty stan­dards like ISO 12100 [1], and CSA Z432 [2]; this makes emer­gency stop sys­tems the back­up to the pri­ma­ry safe­guards. Com­ple­men­tary pro­tec­tive mea­sures are intend­ed to per­mit “avoid­ing or lim­it­ing the harm” that may result from an emer­gent sit­u­a­tion. By def­i­n­i­tion, this is a sit­u­a­tion that has not been fore­seen by the machine builder, or is the result of anoth­er fail­ure. This could be a fail­ure of anoth­er safe­guard­ing sys­tem, or a fail­ure in the machine that is not con­trolled by oth­er means., e.g., a work­piece shat­ters due to a mate­r­i­al flaw, and the bro­ken pieces dam­age the machine, cre­at­ing new, uncon­trolled, fail­ure con­di­tions in the machine.

Emer­gency stop sys­tems are man­u­al­ly trig­gered, and usu­al­ly infre­quent­ly used. The lack of use means that func­tion­al test­ing of the sys­tem doesn’t hap­pen in the nor­mal course of oper­a­tion of the machin­ery. Some types of faults may occur and remain unde­tect­ed until the sys­tem is actu­al­ly used, i.e., con­tact blocks falling off the back of the oper­a­tor device. Fail­ure at that point may be cat­a­stroph­ic, since by impli­ca­tion the pri­ma­ry safe­guards have already failed, and thus the fail­ure of the back­up elim­i­nates the pos­si­bil­i­ty of avoid­ing or lim­it­ing harm.

To under­stand the test­ing require­ments, it’s impor­tant to under­stand the risk and reli­a­bil­i­ty require­ments that dri­ve the design of emer­gency stop sys­tems, and then get into the test fre­quen­cy ques­tion.

Requirements

In the past, there were no explic­it require­ments for emer­gency stop sys­tem reli­a­bil­i­ty. Details like the colour of the oper­a­tor device, or the way the stop func­tion worked were defined in ISO 13850 [3], NFPA 79 [4], and IEC 60204–1 [5]. In the soon-to-be pub­lished 3rd edi­tion of ISO 13850, a new pro­vi­sion requir­ing emer­gency stop sys­tems to meet at least PLc will be added [6], but until pub­li­ca­tion, it is up to the design­er to deter­mine the safe­ty integri­ty lev­el, either PL or SIL, required. To deter­mine the require­ments for any safe­ty func­tion, the key is to start at the risk assess­ment. The risk assess­ment process requires that the design­er under­stand the stage in the life cycle of the machine, the task(s) that will be done, and the spe­cif­ic haz­ards that a work­er may be exposed to while con­duct­ing the task. This can become quite com­plex when con­sid­er­ing main­te­nance and ser­vice tasks, and also applies to fore­see­able fail­ure modes of the machin­ery or the process. The scor­ing or rank­ing of risk can be accom­plished using any suit­able risk scor­ing tool that meets the min­i­mum require­ments in [1]. There are some good exam­ples giv­en in ISO/TR 14121–2 [7] if you are look­ing for some guid­ance. There are many good engi­neer­ing text­books avail­able as well. Have a look at our Book List for some sug­ges­tions if you want a deep­er dive.

Reliability

Once the ini­tial unmit­i­gat­ed risk is under­stood, risk con­trol mea­sures can be spec­i­fied. Wher­ev­er the con­trol sys­tem is used as part of the risk con­trol mea­sure, a safe­ty func­tion must be spec­i­fied. Spec­i­fi­ca­tion of the safe­ty func­tion includes the Per­for­mance Lev­el (PL), archi­tec­tur­al cat­e­go­ry (B, 1–4), Mean Time to Dan­ger­ous Fail­ure (MTTFd), and Diag­nos­tic Cov­er­age (DC) [6], or Safe­ty Integri­ty Lev­el (SIL), and Hard­ware Fault Tol­er­ance (HFT), as described in IEC 62061 [8], as a min­i­mum. If you are unfa­mil­iar with these terms, see the def­i­n­i­tions at the end of the arti­cle.

Refer­ring to Fig­ure 1, the “Risk Graph” [6, Annex A], we can rea­son­ably state that for most machin­ery, a fail­ure mode or emer­gent con­di­tion is like­ly to cre­ate con­di­tions where the sever­i­ty of injury is like­ly to require more than basic first aid, so select­ing “S2″ is the first step. In these sit­u­a­tions, and par­tic­u­lar­ly where the fail­ure modes are not well under­stood, the high­est lev­el of sever­i­ty of injury, S2, is select­ed because we don’t have enough infor­ma­tion to expect that the injuries would only be minor. As soon as we make this selec­tion, it is no longer pos­si­ble to select any com­bi­na­tion of Fre­quen­cy or Prob­a­bil­i­ty para­me­ters that will result in any­thing low­er than PLc.

It’s impor­tant to under­stand that Fig­ure 1 is not a risk assess­ment tool, but rather a deci­sion tree used to select an appro­pri­ate PL based on the rel­e­vant risk para­me­ters. Those para­me­ters are:

Table 1 — Risk Para­me­ters
Sever­i­ty of Injury fre­quen­cy and/or expo­sure to haz­ard pos­si­bil­i­ty of avoid­ing haz­ard or lim­it­ing harm
S1 — slight (nor­mal­ly reversible injury) F1 — sel­dom-to-less-often and/or expo­sure time is short P1 — pos­si­ble under spe­cif­ic con­di­tions
S2 — seri­ous (nor­mal­ly irre­versible injury or death) F2 — fre­quent-to-con­tin­u­ous and/or expo­sure time is long P2 — scarce­ly pos­si­ble
Decision tree used to determine PL based on risk parameters.
Fig­ure 1 — “Risk Graph” for deter­min­ing PL

PLc can be accom­plished using any of three archi­tec­tures: Cat­e­go­ry 1, 2, or 3. If you are unsure about what these archi­tec­tures rep­re­sent, have a look at my series cov­er­ing this top­ic.

Cat­e­go­ry 1 is sin­gle chan­nel, and does not include any diag­nos­tics. A sin­gle fault can cause the loss of the safe­ty func­tion (i.e., the machine still runs even though the e-stop but­ton is pressed). Using Cat­e­go­ry 1, the reli­a­bil­i­ty of the design is based on the use of high­ly reli­able com­po­nents and well-tried safe­ty prin­ci­ples. This approach can fail to dan­ger.

Cat­e­go­ry 2 adds some diag­nos­tic capa­bil­i­ty to the basic sin­gle chan­nel con­fig­u­ra­tion and does not require the use of “well-tried” com­po­nents. This approach can also fail to dan­ger.

Cat­e­go­ry 3 archi­tec­ture adds a redun­dant chan­nel, and includes diag­nos­tic cov­er­age. Cat­e­go­ry 3 is not sub­ject to fail­ure due to sin­gle faults and is called “sin­gle-fault tol­er­ant”. This approach is less like­ly to fail to dan­ger, but still can in the pres­ence of mul­ti­ple, unde­tect­ed, faults.

A key con­cept in reli­a­bil­i­ty is the “fault”. This can be any kind of defect in hard­ware or soft­ware that results in unwant­ed behav­iour or a fail­ure. Faults are fur­ther bro­ken down into dan­ger­ous and safe faults, mean­ing those that result in a dan­ger­ous out­come, and those that do not. Final­ly, each of these class­es is bro­ken down into detectable and unde­tectable faults. I’m not going to get into the math­e­mat­i­cal treat­ment of these class­es, but my point is this: there are unde­tectable dan­ger­ous faults. These are faults that can­not be detect­ed by built-in diag­nos­tics. As design­ers, we try to design the con­trol sys­tem so that the unde­tectable dan­ger­ous faults are extreme­ly rare, ide­al­ly the prob­a­bil­i­ty should be much less than once in the life­time of the machine.

What is the life­time of the machine? The stan­dards writ­ers have set­tled on a default life­time of 20 years, thus the answer is that unde­tectable dan­ger­ous fail­ures should hap­pen much less than once in twen­ty years of 24/7/365 oper­a­tion. So why does this mat­ter? Each archi­tec­tur­al cat­e­go­ry has dif­fer­ent require­ments for test­ing. The test rates are dri­ven by the “Demand Rate”. The Demand Rate is defined in [6]. “SRP/CS” stands for “Safe­ty Relat­ed Part of the Con­trol Sys­tem” in the def­i­n­i­tion:

3.1.30
demand rate (rd) — fre­quen­cy of demands for a safe­ty-relat­ed action of the SRP/CS

Each time the emer­gency stop but­ton is pressed, a “demand” is put on the sys­tem. Look­ing at the “Sim­pli­fied Pro­ce­dure for esti­mat­ing PL”, [6, 4.5.4], we find that the stan­dard makes the fol­low­ing assump­tions:

  • mis­sion time, 20 years (see Clause 10);
  • con­stant fail­ure rates with­in the mis­sion time;
  • for cat­e­go­ry 2, demand rate <= 1/100 test rate;
  • for cat­e­go­ry 2, MTTFDTE larg­er than half of MTTFDL.

NOTE When blocks of each chan­nel can­not be sep­a­rat­ed, the fol­low­ing can be applied: MTTFD of the sum­ma­rized test chan­nel (TE, OTE) larg­er than half MTTFD of the sum­ma­rized func­tion­al chan­nel (I, L, O).

So what does all that mean? The 20-year mis­sion time is the assumed life­time of the machin­ery. This num­ber under­pins the rest of the cal­cu­la­tions in the stan­dard and is based on the idea that few mod­ern con­trol sys­tems last longer than 20 years with­out being replaced or rebuilt. The con­stant fail­ure rate points at the idea that sys­tems used in the field will have com­po­nents and con­trols that are not sub­ject to infant mor­tal­i­ty, nor are they old enough to start to fail due to age, but rather that the sys­tem is oper­at­ing in the flat por­tion of the stan­dard­ized fail­ure rate “bath­tub curve”, [9]. See Fig­ure 2. Com­po­nents that are sub­ject to infant mor­tal­i­ty failed at the fac­to­ry and were removed from the sup­ply chain. Those fail­ing from “wear-out” are expect­ed to reach that point after 20 years. If this is not the case, then the main­te­nance instruc­tions for the sys­tem should include pre­ven­ta­tive main­te­nance tasks that require replac­ing crit­i­cal com­po­nents before they reach the pre­dict­ed MTTFd.

Diagram of a standardized bathtub-shaped failure rate curve.
Fig­ure 2 — Weibull Bath­tub Curve [9]
For sys­tems using Cat­e­go­ry 2 archi­tec­ture, the auto­mat­ic diag­nos­tic test rate must be at least 100x the demand rate. Keep in mind that this test rate is nor­mal­ly accom­plished auto­mat­i­cal­ly in the design of the con­trols, and is only relat­ed to the detectable safe or dan­ger­ous faults. Unde­tectable faults must have a prob­a­bil­i­ty of less than once in 20 years, and should be detect­ed by the “proof test”. More on that a bit lat­er.

Final­ly, the MTTFD of the func­tion­al chan­nel must be at least twice that of the diag­nos­tic sys­tem.

Cat­e­go­ry 1 has no diag­nos­tics, so there is no guid­ance in [6] to help us out with these sys­tems. Cat­e­go­ry 3 is sin­gle fault tol­er­ant, so as long as we don’t have mul­ti­ple unde­tect­ed faults we can count on the sys­tem to func­tion and to alert us when a sin­gle fault occurs; remem­ber that the auto­mat­ic tests may not be able to detect every fault. This is where the “proof test” comes in. What is a proof test? To find a def­i­n­i­tion for the proof test, we have to look at IEC 61508–4 [10]:

3.8.5
proof test
peri­od­ic test per­formed to detect fail­ures in a safe­ty-relat­ed sys­tem so that, if nec­es­sary, the sys­tem can be restored to an “as new” con­di­tion or as close as prac­ti­cal to this con­di­tion

NOTE — The effec­tive­ness of the proof test will be depen­dent upon how close to the “as new” con­di­tion the sys­tem is restored. For the proof test to be ful­ly effec­tive, it will be nec­es­sary to detect 100% of all dan­ger­ous fail­ures. Although in prac­tice 100% is not eas­i­ly achieved for oth­er than low-com­plex­i­ty E/E/PE safe­ty-relat­ed sys­tems, this should be the tar­get. As a min­i­mum, all the safe­ty func­tions which are exe­cut­ed are checked accord­ing to the E/E/PES safe­ty require­ments spec­i­fi­ca­tion. If sep­a­rate chan­nels are used, these tests are done for each chan­nel sep­a­rate­ly.

The 20-year life cycle assump­tion used in the stan­dards also applies to proof test­ing. Machine con­trols are assumed to get at least one proof test in their life­time. The proof test should be designed to detect faults that the auto­mat­ic diag­nos­tics can­not detect. Proof tests are also con­duct­ed after major rebuilds and repairs to ensure that the sys­tem oper­ates cor­rect­ly.

If you know the archi­tec­ture of the emer­gency stop con­trol sys­tem, you can deter­mine the test rate based on the demand rate. It would be con­sid­er­ably eas­i­er if the stan­dards just gave us some min­i­mum test rates for the var­i­ous archi­tec­tures. One stan­dard, ISO 14119 [11] on inter­locks does just that. Admit­ted­ly, this stan­dard does not include emer­gency stop func­tions with­in its scope, as its focus is on inter­locks, but since inter­lock­ing sys­tems are more crit­i­cal than the com­ple­men­tary pro­tec­tive mea­sures that back them up, it would be rea­son­able to apply these same rules. Look­ing at the clause on Assess­ment of Faults, [9, 8.2], we find this guid­ance:

For appli­ca­tions using inter­lock­ing devices with auto­mat­ic mon­i­tor­ing to achieve the nec­es­sary diag­nos­tic cov­er­age for the required safe­ty per­for­mance, a func­tion­al test (see IEC 60204–1:2005, 9.4.2.4) can be car­ried out every time the device changes its state, e.g. at every access. If, in such a case, there is only infre­quent access, the inter­lock­ing device shall be used with addi­tion­al mea­sures, because between con­sec­u­tive func­tion­al tests the prob­a­bil­i­ty of occur­rence of an unde­tect­ed fault is increased.

When a man­u­al func­tion­al test is nec­es­sary to detect a pos­si­ble accu­mu­la­tion of faults, it shall be made with­in the fol­low­ing test inter­vals:

  • at least every month for PLe with Cat­e­go­ry 3 or Cat­e­go­ry 4 (accord­ing to ISO 13849–1) or SIL 3 with HFT (hard­ware fault tol­er­ance) = 1 (accord­ing to IEC 62061);
  • at least every 12 months for PLd with Cat­e­go­ry 3 (accord­ing to ISO 13849–1) or SIL 2 with HFT (hard­ware fault tol­er­ance) = 1 (accord­ing to IEC 62061).

NOTE It is rec­om­mend­ed that the con­trol sys­tem of a machine demands these tests at the required inter­vals e.g. by visu­al dis­play unit or sig­nal lamp. The con­trol sys­tem should mon­i­tor the tests and stop the machine if the test is omit­ted or fails.

In the pre­ced­ing, HFT=1 is equiv­a­lent to say­ing that the sys­tem is sin­gle-fault tol­er­ant.

This leaves us then with rec­om­mend­ed test fre­quen­cies for Cat­e­go­ry 2 and 3 archi­tec­tures in PLc, PLd, and PLe, or for SIL 2 and 3 with HFT=1. We still don’t have a test fre­quen­cy for PLc, Cat­e­go­ry 1 sys­tems. There is no explic­it guid­ance for these sys­tems in the stan­dards. How can we deter­mine a test rate for these sys­tems?

My approach would be to start by exam­in­ing the MTTFd val­ues for all of the sub­sys­tems and com­po­nents. [6] requires that the sys­tem has HIGH MTTFd val­ue, mean­ing 30 years <= MTTFd <= 100 years [6, Table 5]. If this is the case, then the once-in-20-years proof test is the­o­ret­i­cal­ly enough. If the sys­tem is con­struct­ed, for exam­ple, as shown Fig­ure 2 below, then each com­po­nent would have to have an MTTFd > 120 years. See [6, Annex C] for this cal­cu­la­tion.

Basic Stop/Start Circuit
Fig­ure 2 — Basic Stop/Start Cir­cuit

PB1 — Emer­gency Stop But­ton

PB2 — Pow­er “ON” But­ton

MCR — Mas­ter Con­trol Relay

MOV — Surge Sup­pres­sor on MCR Coil

M1 — Machine prime mover (motor)

Note that the fus­es are not includ­ed, since they can only fail to safe­ty, and assum­ing that they were spec­i­fied cor­rect­ly in the orig­i­nal design, are not sub­ject to the same cycli­cal aging effects as the oth­er com­po­nents.

M1 is not includ­ed since it is the con­trolled por­tion of the machine and is not part of the con­trol sys­tem.

If a review of the com­po­nents of the sys­tem shows that any sin­gle com­po­nent falls below the tar­get MTTFD, then I would con­sid­er replac­ing the sys­tem with a high­er cat­e­go­ry design. Since most of these com­po­nents will be unlike­ly to have MTTFD val­ues on the spec sheet, you will like­ly have to con­vert from total life val­ues (B10). This is out­side the scope of this arti­cle, but you can find guid­ance in [6, Annex C]. More fre­quent test­ing, i.e., more than once in 20 years, is always accept­able.

Where man­u­al test­ing is required as part of the design for any cat­e­go­ry of sys­tem, and par­tic­u­lar­ly in Cat­e­go­ry 1 or 2 sys­tems, the con­trol sys­tem should alert the user to the require­ment and not per­mit the machine to oper­ate until the test is com­plet­ed. This will help to ensure that the req­ui­site tests are prop­er­ly com­plet­ed.

Need more infor­ma­tion? Leave a com­ment below, or send me an email with the details of your appli­ca­tion!

Definitions

3.1.9 [8]
func­tion­al safe­ty
part of the over­all safe­ty relat­ing to the EUC and the EUC con­trol sys­tem which depends on the cor­rect func­tion­ing of the E/E/PE safe­ty-relat­ed sys­tems, oth­er tech­nol­o­gy safe­ty-relat­ed sys­tems and exter­nal risk reduc­tion facil­i­ties
3.2.6 [8]
electrical/electronic/programmable elec­tron­ic (E/E/PE)
based on elec­tri­cal (E) and/or elec­tron­ic (E) and/or pro­gram­ma­ble elec­tron­ic (PE) tech­nol­o­gy

NOTE — The term is intend­ed to cov­er any and all devices or sys­tems oper­at­ing on elec­tri­cal prin­ci­ples.

EXAMPLE Electrical/electronic/programmable elec­tron­ic devices include

  • electro­mechan­i­cal devices (elec­tri­cal);
  • sol­id-state non-pro­gram­ma­ble elec­tron­ic devices (elec­tron­ic);
  • elec­tron­ic devices based on com­put­er tech­nol­o­gy (pro­gram­ma­ble elec­tron­ic); see 3.2.5
3.5.1 [8]
safe­ty func­tion
func­tion to be imple­ment­ed by an E/E/PE safe­ty-relat­ed sys­tem, oth­er tech­nol­o­gy safe­ty-relat­ed sys­tem or exter­nal risk reduc­tion facil­i­ties, which is intend­ed to achieve or main­tain a safe state for the EUC, in respect of a spe­cif­ic haz­ardous event (see 3.4.1)
3.5.2 [8]
safe­ty integri­ty
prob­a­bil­i­ty of a safe­ty-relat­ed sys­tem sat­is­fac­to­ri­ly per­form­ing the required safe­ty func­tions under all the stat­ed con­di­tions with­in a stat­ed peri­od of time
NOTE 1 — The high­er the lev­el of safe­ty integri­ty of the safe­ty-relat­ed sys­tems, the low­er the prob­a­bil­i­ty that the safe­ty-relat­ed sys­tems will fail to car­ry out the required safe­ty func­tions.
NOTE 2 — There are four lev­els of safe­ty integri­ty for sys­tems (see 3.5.6).
3.5.6 [8]
safe­ty integri­ty lev­el (SIL)
dis­crete lev­el (one out of a pos­si­ble four) for spec­i­fy­ing the safe­ty integri­ty require­ments of the safe­ty func­tions to be allo­cat­ed to the E/E/PE safe­ty-relat­ed sys­tems, where safe­ty integri­ty lev­el 4 has the high­est lev­el of safe­ty integri­ty and safe­ty integri­ty lev­el 1 has the low­est
NOTE — The tar­get fail­ure mea­sures (see 3.5.13) for the four safe­ty integri­ty lev­els are spec­i­fied in tables 2 and 3 of IEC 61508–1.
3.6.3 [8]
fault tol­er­ance
abil­i­ty of a func­tion­al unit to con­tin­ue to per­form a required func­tion in the pres­ence of faults or errors
NOTE — The def­i­n­i­tion in IEV 191–15-05 refers only to sub-item faults. See the note for the term fault in 3.6.1.
[ISO/IEC 2382–14-04–061]
3.1.1 [6]
safety–related part of a con­trol sys­tem (SRP/CS)
part of a con­trol sys­tem that responds to safe­ty-relat­ed input sig­nals and gen­er­ates safe­ty-relat­ed out­put sig­nals
NOTE 1 The com­bined safe­ty-relat­ed parts of a con­trol sys­tem start at the point where the safe­ty-relat­ed input sig­nals are ini­ti­at­ed (includ­ing, for exam­ple, the actu­at­ing cam and the roller of the posi­tion switch) and end at the out­put of the pow­er con­trol ele­ments (includ­ing, for exam­ple, the main con­tacts of a con­tac­tor).
NOTE 2 If mon­i­tor­ing sys­tems are used for diag­nos­tics, they are also con­sid­ered as SRP/CS.
3.1.2 [6]
cat­e­go­ry
clas­si­fi­ca­tion of the safe­ty-relat­ed parts of a con­trol sys­tem in respect of their resis­tance to faults and their sub­se­quent behav­iour in the fault con­di­tion, and which is achieved by the struc­tur­al arrange­ment of the parts, fault detec­tion and/or by their reli­a­bil­i­ty
3.1.3 [6]
fault
state of an item char­ac­ter­ized by the inabil­i­ty to per­form a required func­tion, exclud­ing the inabil­i­ty dur­ing pre­ven­tive main­te­nance or oth­er planned actions, or due to lack of exter­nal resources

NOTE 1 A fault is often the result of a fail­ure of the item itself, but may exist with­out pri­or fail­ure.
[IEC 60050–191:1990, 05–01]

NOTE 2 In this part of ISO 13849, “fault” means ran­dom fault.

3.1.4 [6]
fail­ure
ter­mi­na­tion of the abil­i­ty of an item to per­form a required func­tion

NOTE 1 After a fail­ure, the item has a fault.

NOTE 2 “Fail­ure” is an event, as dis­tin­guished from “fault”, which is a state.

NOTE 3 The con­cept as defined does not apply to items con­sist­ing of soft­ware only.
[IEC 60050–191:1990, 04–01]

NOTE 4 Fail­ures which only affect the avail­abil­i­ty of the process under con­trol are out­side of the scope of this part of ISO 13849.

3.1.5 [6]
dan­ger­ous fail­ure
fail­ure which has the poten­tial to put the SRP/CS in a haz­ardous or fail-to-func­tion state

NOTE 1 Whether or not the poten­tial is real­ized can depend on the chan­nel archi­tec­ture of the sys­tem; in redun­dant sys­tems, a dan­ger­ous hard­ware fail­ure is less like­ly to lead to the over­all dan­ger­ous or fail-to-func­tion state.

NOTE 2 Adapt­ed from IEC 61508–4:1998, def­i­n­i­tion 3.6.7.

3.1.20 [6]
safe­ty func­tion
func­tion of the machine whose fail­ure can result in an imme­di­ate increase of the risk(s)
[ISO 12100–1:2003, 3.28]
3.1.21 [6]
mon­i­tor­ing
safe­ty func­tion which ensures that a pro­tec­tive mea­sure is ini­ti­at­ed if the abil­i­ty of a com­po­nent or an ele­ment to per­form its func­tion is dimin­ished or if the process con­di­tions are changed in such a way that a decrease of the amount of risk reduc­tion is gen­er­at­ed
3.1.22 [6]
pro­gram­ma­ble elec­tron­ic sys­tem (PES)
sys­tem for con­trol, pro­tec­tion or mon­i­tor­ing depen­dent for its oper­a­tion on one or more pro­gram­ma­ble elec­tron­ic devices, includ­ing all ele­ments of the sys­tem such as pow­er sup­plies, sen­sors and oth­er input devices, con­tac­tors and oth­er out­put devices

NOTE Adapt­ed from IEC 61508–4:1998, def­i­n­i­tion 3.3.2.

3.1.23 [6]
per­for­mance lev­el (PL)
dis­crete lev­el used to spec­i­fy the abil­i­ty of safe­ty-relat­ed parts of con­trol sys­tems to per­form a safe­ty func­tion under fore­see­able con­di­tions

NOTE See 4.5.1.

3.1.25 [6]
mean time to dan­ger­ous fail­ure (MTTFd)
expec­ta­tion of the mean time to dan­ger­ous fail­ure

NOTE Adapt­ed from IEC 62061:2005, def­i­n­i­tion 3.2.34.

3.1.26 [6]
diag­nos­tic cov­er­age (DC)
mea­sure of the effec­tive­ness of diag­nos­tics, which may be deter­mined as the ratio between the fail­ure rate of detect­ed dan­ger­ous fail­ures and the fail­ure rate of total dan­ger­ous fail­ures

NOTE 1 Diag­nos­tic cov­er­age can exist for the whole or parts of a safe­ty-relat­ed sys­tem. For exam­ple, diag­nos­tic cov­er­age could exist for sen­sors and/or log­ic sys­tem and/or final ele­ments.

NOTE 2 Adapt­ed from IEC 61508–4:1998, def­i­n­i­tion 3.8.6.

3.1.33 [6]
safe­ty integri­ty lev­el (SIL)
dis­crete lev­el (one out of a pos­si­ble four) for spec­i­fy­ing the safe­ty integri­ty require­ments of the safe­ty func­tions to be allo­cat­ed to the E/E/PE safe­ty-relat­ed sys­tems, where safe­ty integri­ty lev­el 4 has the high­est lev­el of safe­ty integri­ty and safe­ty integri­ty lev­el 1 has the low­est
[IEC 61508–4:1998, 3.5.6]

Acknowledgements

Thanks to my col­leagues Derek Jones and Jonathan John­son, both from Rock­well Automa­tion, and mem­bers of ISO TC199. Their sug­ges­tion to ref­er­ence ISO 14119 clause 8.2 was the seed for this arti­cle.

I’d also like to acknowl­edge Ronald Sykes, Howard Tou­s­ki, Mirela Moga, Michael Roland, and Grant Rid­er for ask­ing the ques­tions that lead to this arti­cle.

References

[1]     Safe­ty of machin­ery — Gen­er­al prin­ci­ples for design — Risk assess­ment and risk reduc­tion. ISO 12100. Inter­na­tion­al Orga­ni­za­tion for Stan­dard­iza­tion (ISO). Gene­va 2010.

[2]    Safe­guard­ing of Machin­ery. CSA Z432. Cana­di­an Stan­dards Asso­ci­a­tion. Toron­to. 2004.

[3]    Safe­ty of machin­ery – Emer­gency stop – Prin­ci­ples for design. ISO 13850. Inter­na­tion­al Orga­ni­za­tion for Stan­dard­iza­tion (ISO). Gene­va 2006.

[4]    Elec­tri­cal Stan­dard for Indus­tri­al Machin­ery. NFPA 79. Nation­al Fire Pro­tec­tion Asso­ci­a­tion (NFPA). Bat­tery­march Park. 2015

[5]    Safe­ty of machin­ery – Elec­tri­cal equip­ment of machines – Part 1: Gen­er­al require­ments. IEC 60204–1. Inter­na­tion­al Elec­trotech­ni­cal Com­mis­sion (IEC). Gene­va. 2009.

[6]    Safe­ty of machin­ery — Safe­ty-relat­ed parts of con­trol sys­tems — Part 1: Gen­er­al prin­ci­ples for design.  ISO 13849–1. Inter­na­tion­al Orga­ni­za­tion for Stan­dard­iza­tion (ISO). Gene­va. 2006.

[7]    Safe­ty of machin­ery — Risk assess­ment — Part 2: Prac­ti­cal guid­ance and exam­ples of meth­ods. ISO/TR 14121–2. Inter­na­tion­al Orga­ni­za­tion for Stan­dard­iza­tion (ISO). Gene­va. 2012.

[8]   Safe­ty of machin­ery – Func­tion­al safe­ty of safe­ty-relat­ed elec­tri­cal, elec­tron­ic and pro­gram­ma­ble elec­tron­ic con­trol sys­tems. IEC 62061. Inter­na­tion­al Elec­trotech­ni­cal Com­mis­sion (IEC). Gene­va. 2005.

[9]    D. J. Wilkins (2002, Novem­ber). “The Bath­tub Curve and Prod­uct Fail­ure Behav­ior. Part One — The Bath­tub Curve, Infant Mor­tal­i­ty and Burn-in”. Reli­a­bil­i­ty Hot­line [Online]. Avail­able: http://www.weibull.com/hotwire/issue21/hottopics21.htm. [Accessed: 26-Apr-2015].

[10] Func­tion­al safe­ty of electrical/electronic/programmable elec­tron­ic safe­ty-relat­ed sys­tems — Part 4: Def­i­n­i­tions and abbre­vi­a­tions. IEC 61508–4. Inter­na­tion­al Elec­trotech­ni­cal Com­mis­sion (IEC). Gene­va. 1998.

[11] Safe­ty of machin­ery — Inter­lock­ing devices asso­ci­at­ed with guards — Prin­ci­ples for design and selec­tion. ISO 14119. Inter­na­tion­al Orga­ni­za­tion for Stan­dard­iza­tion (ISO). Gene­va. 2013.

Sources for Standards

CANADA

Cana­di­an Stan­dards Asso­ci­a­tion sells CSA, ISO and IEC stan­dards to the Cana­di­an Mar­ket.

USA

ANSI offers stan­dards from most US Stan­dards Devel­op­ment Orga­ni­za­tions. They also sell ISO and IEC stan­dards into the US mar­ket.


International

Inter­na­tion­al Orga­ni­za­tion for Stan­dard­iza­tion (ISO).

Inter­na­tion­al Elec­trotech­ni­cal Com­mis­sion (IEC).

Europe

Each EU mem­ber state has their own stan­dards body. For rea­sons unknown to me, each stan­dards body can set their own pric­ing for the doc­u­ments they sell. All offer Eng­lish lan­guage copies, in addi­tion to copies in the offi­cial language(s) of the mem­ber state. My best advice is to shop around a bit. Prices can vary by as much as 10:1.

British Stan­dards Insti­tute (BSi) $$$

Dan­ish Stan­dards (DS) $

Eston­ian Stan­dards (EVS) $

Ger­man stan­dards (DIN) — Beuth Ver­lag GmbH