Emergency Stop Failures

This entry is part 13 of 13 in the series Emergency Stop

I am always look­ing for inter­est­ing examples of machinery safety prob­lems to share on MS101. Recently I was scrolling Reddit/​r/​OSHA and found these three real-​world examples.

Broken Emergency Stop Buttons

The first and most obvi­ous kinds of fail­ures are those res­ult­ing from either wear out or dam­age to emer­gency stop devices like e-​stop but­tons or pull cords. Here’s a great example:

Won’t be stop­ping this elev­at­or any­time soon. from OSHA

The oper­at­or device in this pic­ture has two prob­lems:

1) the but­ton oper­at­or has failed and

2) the e-​stop is incor­rectly marked.

The cor­rect mark­ing would be a yel­low back­ground in place of the red/​silver legend plate, like the example below. The yel­low back­ground could have the words “emer­gency stop” on it, but this is not neces­sary as the col­our com­bin­a­tion is enough.

Yellow circular legend plate with the words "emergency stop" in black letters. Fits A-B 800T pushbutton operators.
Allen-​Bradley 800T Emergency Stop legend plate

There is an ISO/​IEC sym­bol for an emer­gency stop that could also be used [1].

Emergency stop symbol. A circle containing an equalateral triangle pointing downward, containing an exclamation mark.
Emergency Stop Symbol IEC 60417 – 5638 [1]
I won­der how the con­tact block(s) inside the enclos­ure are doing? Contact blocks have been known to fall off the back of emer­gency stop oper­at­or but­tons, leav­ing you with a but­ton that does noth­ing when pressed. Contact blocks secured with screws are most vul­ner­able to this kind of fail­ure. Losing a con­tact block like this hap­pens most often in high-​vibration con­di­tions. I have run across this in real life while doing inspec­tions on cli­ent sites.

There are con­tact blocks made to detect this kind of fail­ure, like Allen Bradley’s self-​monitoring con­tact block, 800TC-​XD4S, or the sim­il­ar Siemens product,3SB34. Most con­trols com­pon­ent man­u­fac­tur­ers will be likely to have sim­il­ar com­pon­ents.

Here’s anoth­er example from a machine inspec­tion I did a while ago. Note the wire “keep­er” that pre­vents the but­ton from get­ting lost!


Installation Failures

Here is an example of poor plan­ning when installing new bar­ri­er guards. The emer­gency stop but­ton is now out of reach. The ori­gin­al poster does not indic­ate a reas­on why the emer­gency stop for the machine he was oper­at­ing was moun­ted on a dif­fer­ent machine.

sure hope i nev­er need to hit that emer­gency stop but­ton. its for the machine on my side of the new fence. from OSHA

No Emergency Stop at all

Finally, and pos­sibly the worst example of all. Here is an impro­vised emer­gency stop using a set of wire cut­ters. No fur­ther com­ment required.

Emergency stop but­ton. from OSHA

If you have any examples you would like to share, feel free to add them in com­ments below. References to par­tic­u­lar employ­ers or man­u­fac­tur­ers will be deleted before posts are approved.

References

[1]     “IEC 60417 – 5638, Emergency Stop”, Iso​.org, 2017. [Online]. Available: https://​www​.iso​.org/​o​b​p​/​u​i​/​#​i​e​c​:​g​r​s​:​6​0​4​1​7​:​5​638. [Accessed: 27- Jun- 2017].

ISO 13849 – 1 Analysis — Part 6: CCF — Common Cause Failures

This entry is part 6 of 9 in the series How to do a 13849 – 1 ana­lys­is

What is a Common Cause Failure?

There are two similar-​sounding terms that people often get con­fused: Common Cause Failure (CCF) and Common Mode Failure. While these two types of fail­ures sound sim­il­ar, they are dif­fer­ent. A Common Cause Failure is a fail­ure in a sys­tem where two or more por­tions of the sys­tem fail at the same time from a single com­mon cause. An example could be a light­ning strike that causes a con­tact­or to weld and sim­ul­tan­eously takes out the safety relay pro­cessor that con­trols the con­tact­or. Common cause fail­ures are there­fore two dif­fer­ent man­ners of fail­ure in two dif­fer­ent com­pon­ents, but with a single cause.

Common Mode Failure is where two com­pon­ents or por­tions of a sys­tem fail in the same way, at the same time. For example, two inter­pos­ing relays both fail with wel­ded con­tacts at the same time. The fail­ures could be caused by the same cause or from dif­fer­ent causes, but the way the com­pon­ents fail is the same.

Common-​cause fail­ure includes com­mon mode fail­ure, since a com­mon cause can res­ult in a com­mon man­ner of fail­ure in identic­al devices used in a sys­tem.

Here are the form­al defin­i­tions of these terms:

3.1.6 com­mon cause fail­ure CCF

fail­ures of dif­fer­ent items, res­ult­ing from a single event, where these fail­ures are not con­sequences of each oth­er

Note 1 to entry: Common cause fail­ures should not be con­fused with com­mon mode fail­ures (see ISO 12100:2010, 3.36). [SOURCE: IEC 60050?191-am1:1999, 04 – 23.] [1]

 

3.36 com­mon mode fail­ures

fail­ures of items char­ac­ter­ized by the same fault mode

NOTE Common mode fail­ures should not be con­fused with com­mon cause fail­ures, as the com­mon mode fail­ures can res­ult from dif­fer­ent causes. [lEV 191 – 04-​24] [3]

The “com­mon mode” fail­ure defin­i­tion uses the phrase “fault mode”, so let’s look at that as well:

fail­ure mode
DEPRECATED: fault mode
man­ner in which fail­ure occurs

Note 1 to entry: A fail­ure mode may be defined by the func­tion lost or oth­er state trans­ition that occurred. [IEV 192 – 03-​17] [17]

As you can see, “fault mode” is no longer used, in favour of the more com­mon “fail­ure mode”, so it is pos­sible to re-​write the common-​mode fail­ure defin­i­tion to read, “fail­ures of items char­ac­ter­ised by the same man­ner of fail­ure.”

Random, Systematic and Common Cause Failures

Why do we need to care about this? There are three man­ners in which fail­ures occur: ran­dom fail­ures, sys­tem­at­ic fail­ures, and com­mon cause fail­ures. When devel­op­ing safety related con­trols, we need to con­sider all three and mit­ig­ate them as much as pos­sible.

Random fail­ures do not fol­low any pat­tern, occur­ring ran­domly over time, and are often brought on by over-​stressing the com­pon­ent, or from man­u­fac­tur­ing flaws. Random fail­ures can increase due to envir­on­ment­al or process-​related stresses, like cor­ro­sion, EMI, nor­mal wear-​and-​tear, or oth­er over-​stressing of the com­pon­ent or sub­sys­tem. Random fail­ures are often mit­ig­ated through selec­tion of high-​reliability com­pon­ents [18].

Systematic fail­ures include common-​cause fail­ures, and occur because some human beha­viour occurred that was not caught by pro­ced­ur­al means. These fail­ures are due to design, spe­cific­a­tion, oper­at­ing, main­ten­ance, and install­a­tion errors. When we look at sys­tem­at­ic errors, we are look­ing for things like train­ing of the sys­tem design­ers, or qual­ity assur­ance pro­ced­ures used to val­id­ate the way the sys­tem oper­ates. Systematic fail­ures are non-​random and com­plex, mak­ing them dif­fi­cult to ana­lyse stat­ist­ic­ally. Systematic errors are a sig­ni­fic­ant source of common-​cause fail­ures because they can affect redund­ant devices, and because they are often determ­in­ist­ic, occur­ring whenev­er a set of cir­cum­stances exist.

Systematic fail­ures include many types of errors, such as:

  • Manufacturing defects, e.g., soft­ware and hard­ware errors built into the device by the man­u­fac­turer.
  • Specification mis­takes, e.g. incor­rect design basis and inac­cur­ate soft­ware spe­cific­a­tion.
  • Implementation errors, e.g., improp­er install­a­tion, incor­rect pro­gram­ming, inter­face prob­lems, and not fol­low­ing the safety manu­al for the devices used to real­ise the safety func­tion.
  • Operation and main­ten­ance, e.g., poor inspec­tion, incom­plete test­ing and improp­er bypassing [18].

Diverse redund­ancy is com­monly used to mit­ig­ate sys­tem­at­ic fail­ures, since dif­fer­ences in com­pon­ent or sub­sys­tem design tend to cre­ate non-​overlapping sys­tem­at­ic fail­ures, redu­cing the like­li­hood of a com­mon error cre­at­ing a common-​mode fail­ure. Errors in spe­cific­a­tion, imple­ment­a­tion, oper­a­tion and main­ten­ance are not affected by diversity.

Fig 1 below shows the res­ults of a small study done by the UK’s Health and Safety Executive in 1994 [19] that sup­ports the idea that sys­tem­at­ic fail­ures are a sig­ni­fic­ant con­trib­ut­or to safety sys­tem fail­ures. The study included only 34 sys­tems (n=34), so the res­ults can­not be con­sidered con­clus­ive. However, there were some start­ling res­ults. As you can see, errors in the spe­cific­a­tion of the safety func­tions (Safety Requirement Specification) res­ul­ted in about 44% of the sys­tem fail­ures in the study. Based on this small sample, sys­tem­at­ic fail­ures appear to be a sig­ni­fic­ate source of fail­ures.

Pie chart illustrating the proportion of failures in each phase of the life cycle of a machine, based on data taken from HSE Report HSG238.
Figure 1 – HSG 238 Primary Causes of Failure by Life Cycle Stage

Handling CCF in ISO 13849 – 1

Now that we under­stand WHAT Common-​Cause Failure is, and WHY it’s import­ant, we can talk about HOW it is handled in ISO 13849 – 1. Since ISO 13849 – 1 is inten­ded to be a sim­pli­fied func­tion­al safety stand­ard, CCF ana­lys­is is lim­ited to a check­list in Annex F, Table F.1. Note that Annex F is inform­at­ive, mean­ing that it is guid­ance mater­i­al to help you apply the stand­ard. Since this is the case, you could use any oth­er means suit­able for assess­ing CCF mit­ig­a­tion, like those in IEC 61508, or in oth­er stand­ards.

Table F.1 is set up with a series of mit­ig­a­tion meas­ures which are grouped togeth­er in related cat­egor­ies. Each group is provided with a score that can be claimed if you have imple­men­ted the mit­ig­a­tions in that group. ALL OF THE MEASURES in each group must be ful­filled in order to claim the points for that cat­egory. Here’s an example:

A portion of ISO 13849-1 Table F.1.
ISO 13849 – 1:2015, Table F.1 Excerpt

In order to claim the 20 points avail­able for the use of sep­ar­a­tion or segreg­a­tion in the sys­tem design, there must be a sep­ar­a­tion between the sig­nal paths. Several examples of this are giv­en for clar­ity.

Table F.1 lists six groups of mit­ig­a­tion meas­ures. In order to claim adequate CCF mit­ig­a­tion, a min­im­um score of 65 points must be achieved. Only Category 2, 3 and 4 archi­tec­tures are required to meet the CCF require­ments in order to claim the PL, but without meet­ing the CCF require­ment you can­not claim the PL, regard­less of wheth­er the design meets the oth­er cri­ter­ia or not.

One final note on CCF: If you are try­ing to review an exist­ing con­trol sys­tem, say in an exist­ing machine, or in a machine designed by a third party where you have no way to determ­ine the exper­i­ence and train­ing of the design­ers or the cap­ab­il­ity of the company’s change man­age­ment pro­cess, then you can­not adequately assess CCF [8]. This fact is recog­nised in CSA Z432-​16 [20], chapter 8. [20] allows the review­er to simply veri­fy that the archi­tec­tur­al require­ments, exclus­ive of any prob­ab­il­ist­ic require­ments, have been met. This is par­tic­u­larly use­ful for engin­eers review­ing machinery under Ontario’s Pre-​Start Health and Safety require­ments [21], who are fre­quently work­ing with less-​than-​complete design doc­u­ment­a­tion.

In case you missed the first part of the series, you can read it here. In the next art­icle in this series, I’m going to review the pro­cess flow for sys­tem ana­lys­is as cur­rently out­lined in ISO 13849 – 1. Watch for it!

Book List

Here are some books that I think you may find help­ful on this jour­ney:

[0]     B. Main, Risk Assessment: Basics and Benchmarks, 1st ed. Ann Arbor, MI USA: DSE, 2004.

[0.1]  D. Smith and K. Simpson, Safety crit­ic­al sys­tems hand­book. Amsterdam: Elsevier/​Butterworth-​Heinemann, 2011.

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

[0.3]  Overview of tech­niques and meas­ures related to EMC for Functional Safety, 1st ed. Stevenage, UK: Overview of tech­niques and meas­ures related to EMC for Functional Safety, 2013.

References

Note: This ref­er­ence list starts in Part 1 of the series, so “miss­ing” ref­er­ences may show in oth­er parts of the series. The com­plete ref­er­ence list is included in the last post of the series.

[1]     Safety of machinery — Safety-​related parts of con­trol sys­tems — Part 1: General prin­ciples for design. 3rd Edition. ISO Standard 13849 – 1. 2015.

[2]     Safety of machinery – Safety-​related parts of con­trol sys­tems – Part 2: Validation. 2nd Edition. ISO Standard 13849 – 2. 2012.

[3]      Safety of machinery – General prin­ciples for design – Risk assess­ment and risk reduc­tion. ISO Standard 12100. 2010.

[8]     S. Jocelyn, J. Baudoin, Y. Chinniah, and P. Charpentier, “Feasibility study and uncer­tain­ties in the val­id­a­tion of an exist­ing safety-​related con­trol cir­cuit with the ISO 13849 – 1:2006 design stand­ard,” Reliab. Eng. Syst. Saf., vol. 121, pp. 104 – 112, Jan. 2014.

[17]      “fail­ure mode”, 192 – 03-​17, International Electrotechnical Vocabulary. IEC International Electrotechnical Commission, Geneva, 2015.

[18]      M. Gentile and A. E. Summers, “Common Cause Failure: How Do You Manage Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331 – 338, 2006.

[19]     Out of Control — Why con­trol sys­tems go wrong and how to pre­vent fail­ure, 2nd ed. Richmond, Surrey, UK: HSE Health and Safety Executive, 2003.

[20]     Safeguarding of Machinery. 3rd Edition. CSA Standard Z432. 2016.

[21]     O. Reg. 851, INDUSTRIAL ESTABLISHMENTS. Ontario, Canada, 1990.

Testing Emergency Stop Systems

This entry is part 11 of 13 in the series Emergency Stop

Emergency Stop on machine consoleI’ve had a num­ber of ques­tions from read­ers regard­ing test­ing of emer­gency stop sys­tems, and par­tic­u­larly with the fre­quency of test­ing. I addressed the types of tests that might be needed in anoth­er art­icle cov­er­ing Checking Emergency Stop Systems. This art­icle will focus on the fre­quency of test­ing rather than the types of tests.

The Problem

Emergency stop sys­tems are con­sidered to be “com­ple­ment­ary pro­tect­ive meas­ures” in key machinery safety stand­ards like ISO 12100 [1], and CSA Z432 [2]; this makes emer­gency stop sys­tems the backup to the primary safe­guards. Complementary pro­tect­ive meas­ures are inten­ded to per­mit “avoid­ing or lim­it­ing the harm” that may res­ult from an emer­gent situ­ation. By defin­i­tion, this is a situ­ation that has not been fore­seen by the machine build­er, or is the res­ult of anoth­er fail­ure. This could be a fail­ure of anoth­er safe­guard­ing sys­tem, or a fail­ure in the machine that is not con­trolled by oth­er means., e.g., a work­piece shat­ters due to a mater­i­al flaw, and the broken pieces dam­age the machine, cre­at­ing new, uncon­trolled, fail­ure con­di­tions in the machine.

Emergency stop sys­tems are manu­ally triggered, and usu­ally infre­quently used. The lack of use means that func­tion­al test­ing of the sys­tem doesn’t hap­pen in the nor­mal course of oper­a­tion of the machinery. Some types of faults may occur and remain undetec­ted until the sys­tem is actu­ally used, i.e., con­tact blocks fall­ing off the back of the oper­at­or device. Failure at that point may be cata­stroph­ic, since by implic­a­tion the primary safe­guards have already failed, and thus the fail­ure of the backup elim­in­ates the pos­sib­il­ity of avoid­ing or lim­it­ing harm.

To under­stand the test­ing require­ments, it’s import­ant to under­stand the risk and reli­ab­il­ity require­ments that drive the design of emer­gency stop sys­tems, and then get into the test fre­quency ques­tion.

Requirements

In the past, there were no expli­cit require­ments for emer­gency stop sys­tem reli­ab­il­ity. Details like the col­our of the oper­at­or device, or the way the stop func­tion worked were defined in ISO 13850 [3], NFPA 79 [4], and IEC 60204 – 1 [5]. In the soon-​to-​be pub­lished 3rd edi­tion of ISO 13850, a new pro­vi­sion requir­ing emer­gency stop sys­tems to meet at least PLc will be added [6], but until pub­lic­a­tion, it is up to the design­er to determ­ine the safety integ­rity level, either PL or SIL, required. To determ­ine the require­ments for any safety func­tion, the key is to start at the risk assess­ment. The risk assess­ment pro­cess requires that the design­er under­stand the stage in the life cycle of the machine, the task(s) that will be done, and the spe­cif­ic haz­ards that a work­er may be exposed to while con­duct­ing the task. This can become quite com­plex when con­sid­er­ing main­ten­ance and ser­vice tasks, and also applies to fore­see­able fail­ure modes of the machinery or the pro­cess. The scor­ing or rank­ing of risk can be accom­plished using any suit­able risk scor­ing tool that meets the min­im­um require­ments in [1]. There are some good examples giv­en in ISO/​TR 14121 – 2 [7] if you are look­ing for some guid­ance. There are many good engin­eer­ing text­books avail­able as well. Have a look at our Book List for some sug­ges­tions if you want a deep­er dive.

Reliability

Once the ini­tial unmit­ig­ated risk is under­stood, risk con­trol meas­ures can be spe­cified. Wherever the con­trol sys­tem is used as part of the risk con­trol meas­ure, a safety func­tion must be spe­cified. Specification of the safety func­tion includes the Performance Level (PL), archi­tec­tur­al cat­egory (B, 1 – 4), Mean Time to Dangerous Failure (MTTFd), and Diagnostic Coverage (DC) [6], or Safety Integrity Level (SIL), and Hardware Fault Tolerance (HFT), as described in IEC 62061 [8], as a min­im­um. If you are unfa­mil­i­ar with these terms, see the defin­i­tions at the end of the art­icle.

Referring to Figure 1, the “Risk Graph” [6, Annex A], we can reas­on­ably state that for most machinery, a fail­ure mode or emer­gent con­di­tion is likely to cre­ate con­di­tions where the sever­ity of injury is likely to require more than basic first aid, so select­ing “S2″ is the first step. In these situ­ations, and par­tic­u­larly where the fail­ure modes are not well under­stood, the highest level of sever­ity of injury, S2, is selec­ted because we don’t have enough inform­a­tion to expect that the injur­ies would only be minor. As soon as we make this selec­tion, it is no longer pos­sible to select any com­bin­a­tion of Frequency or Probability para­met­ers that will res­ult in any­thing lower than PLc.

It’s import­ant to under­stand that Figure 1 is not a risk assess­ment tool, but rather a decision tree used to select an appro­pri­ate PL based on the rel­ev­ant risk para­met­ers. Those para­met­ers are:

Table 1 – Risk Parameters
Severity of Injury fre­quency and/​or expos­ure to haz­ard pos­sib­il­ity of avoid­ing haz­ard or lim­it­ing harm
S1 – slight (nor­mally revers­ible injury) F1 – seldom-​to-​less-​often and/​or expos­ure time is short P1 – pos­sible under spe­cif­ic con­di­tions
S2 – ser­i­ous (nor­mally irre­vers­ible injury or death) F2 – frequent-​to-​continuous and/​or expos­ure time is long P2 – scarcely pos­sible
Decision tree used to determine PL based on risk parameters.
Figure 1 – “Risk Graph” for determ­in­ing PL

PLc can be accom­plished using any of three archi­tec­tures: Category 1, 2, or 3. If you are unsure about what these archi­tec­tures rep­res­ent, have a look at my series cov­er­ing this top­ic.

Category 1 is single chan­nel, and does not include any dia­gnostics. A single fault can cause the loss of the safety func­tion (i.e., the machine still runs even though the e-​stop but­ton is pressed). Using Category 1, the reli­ab­il­ity of the design is based on the use of highly reli­able com­pon­ents and well-​tried safety prin­ciples. This approach can fail to danger.

Category 2 adds some dia­gnost­ic cap­ab­il­ity to the basic single chan­nel con­fig­ur­a­tion and does not require the use of “well-​tried” com­pon­ents. This approach can also fail to danger.

Category 3 archi­tec­ture adds a redund­ant chan­nel, and includes dia­gnost­ic cov­er­age. Category 3 is not sub­ject to fail­ure due to single faults and is called “single-​fault tol­er­ant”. This approach is less likely to fail to danger, but still can in the pres­ence of mul­tiple, undetec­ted, faults.

A key concept in reli­ab­il­ity is the “fault”. This can be any kind of defect in hard­ware or soft­ware that res­ults in unwanted beha­viour or a fail­ure. Faults are fur­ther broken down into dan­ger­ous and safe faults, mean­ing those that res­ult in a dan­ger­ous out­come, and those that do not. Finally, each of these classes is broken down into detect­able and undetect­able faults. I’m not going to get into the math­em­at­ic­al treat­ment of these classes, but my point is this: there are undetect­able dan­ger­ous faults. These are faults that can­not be detec­ted by built-​in dia­gnostics. As design­ers, we try to design the con­trol sys­tem so that the undetect­able dan­ger­ous faults are extremely rare, ideally the prob­ab­il­ity should be much less than once in the life­time of the machine.

What is the life­time of the machine? The stand­ards writers have settled on a default life­time of 20 years, thus the answer is that undetect­able dan­ger­ous fail­ures should hap­pen much less than once in twenty years of 24/​7/​365 oper­a­tion. So why does this mat­ter? Each archi­tec­tur­al cat­egory has dif­fer­ent require­ments for test­ing. The test rates are driv­en by the “Demand Rate”. The Demand Rate is defined in [6]. “SRP/​CS” stands for “Safety Related Part of the Control System” in the defin­i­tion:

3.1.30
demand rate (rd) – fre­quency of demands for a safety-​related action of the SRP/​CS

Each time the emer­gency stop but­ton is pressed, a “demand” is put on the sys­tem. Looking at the “Simplified Procedure for estim­at­ing PL”, [6, 4.5.4], we find that the stand­ard makes the fol­low­ing assump­tions:

  • mis­sion time, 20 years (see Clause 10);
  • con­stant fail­ure rates with­in the mis­sion time;
  • for cat­egory 2, demand rate <= 1/​100 test rate;
  • for cat­egory 2, MTTFDTE lar­ger than half of MTTFDL.

NOTE When blocks of each chan­nel can­not be sep­ar­ated, the fol­low­ing can be applied: MTTFD of the sum­mar­ized test chan­nel (TE, OTE) lar­ger than half MTTFD of the sum­mar­ized func­tion­al chan­nel (I, L, O).

So what does all that mean? The 20-​year mis­sion time is the assumed life­time of the machinery. This num­ber under­pins the rest of the cal­cu­la­tions in the stand­ard and is based on the idea that few mod­ern con­trol sys­tems last longer than 20 years without being replaced or rebuilt. The con­stant fail­ure rate points at the idea that sys­tems used in the field will have com­pon­ents and con­trols that are not sub­ject to infant mor­tal­ity, nor are they old enough to start to fail due to age, but rather that the sys­tem is oper­at­ing in the flat por­tion of the stand­ard­ized fail­ure rate “bathtub curve”, [9]. See Figure 2. Components that are sub­ject to infant mor­tal­ity failed at the fact­ory and were removed from the sup­ply chain. Those fail­ing from “wear-​out” are expec­ted to reach that point after 20 years. If this is not the case, then the main­ten­ance instruc­tions for the sys­tem should include pre­vent­at­ive main­ten­ance tasks that require repla­cing crit­ic­al com­pon­ents before they reach the pre­dicted MTTFd.

Diagram of a standardized bathtub-shaped failure rate curve.
Figure 2 – Weibull Bathtub Curve [9]
For sys­tems using Category 2 archi­tec­ture, the auto­mat­ic dia­gnost­ic test rate must be at least 100x the demand rate. Keep in mind that this test rate is nor­mally accom­plished auto­mat­ic­ally in the design of the con­trols, and is only related to the detect­able safe or dan­ger­ous faults. Undetectable faults must have a prob­ab­il­ity of less than once in 20 years, and should be detec­ted by the “proof test”. More on that a bit later.

Finally, the MTTFD of the func­tion­al chan­nel must be at least twice that of the dia­gnost­ic sys­tem.

Category 1 has no dia­gnostics, so there is no guid­ance in [6] to help us out with these sys­tems. Category 3 is single fault tol­er­ant, so as long as we don’t have mul­tiple undetec­ted faults we can count on the sys­tem to func­tion and to alert us when a single fault occurs; remem­ber that the auto­mat­ic tests may not be able to detect every fault. This is where the “proof test” comes in. What is a proof test? To find a defin­i­tion for the proof test, we have to look at IEC 61508 – 4 [10]:

3.8.5
proof test
peri­od­ic test per­formed to detect fail­ures in a safety-​related sys­tem so that, if neces­sary, the sys­tem can be restored to an “as new” con­di­tion or as close as prac­tic­al to this con­di­tion

NOTE – The effect­ive­ness of the proof test will be depend­ent upon how close to the “as new” con­di­tion the sys­tem is restored. For the proof test to be fully effect­ive, it will be neces­sary to detect 100% of all dan­ger­ous fail­ures. Although in prac­tice 100% is not eas­ily achieved for oth­er than low-​complexity E/​E/​PE safety-​related sys­tems, this should be the tar­get. As a min­im­um, all the safety func­tions which are executed are checked accord­ing to the E/​E/​PES safety require­ments spe­cific­a­tion. If sep­ar­ate chan­nels are used, these tests are done for each chan­nel sep­ar­ately.

The 20-​year life cycle assump­tion used in the stand­ards also applies to proof test­ing. Machine con­trols are assumed to get at least one proof test in their life­time. The proof test should be designed to detect faults that the auto­mat­ic dia­gnostics can­not detect. Proof tests are also con­duc­ted after major rebuilds and repairs to ensure that the sys­tem oper­ates cor­rectly.

If you know the archi­tec­ture of the emer­gency stop con­trol sys­tem, you can determ­ine the test rate based on the demand rate. It would be con­sid­er­ably easi­er if the stand­ards just gave us some min­im­um test rates for the vari­ous archi­tec­tures. One stand­ard, ISO 14119 [11] on inter­locks does just that. Admittedly, this stand­ard does not include emer­gency stop func­tions with­in its scope, as its focus is on inter­locks, but since inter­lock­ing sys­tems are more crit­ic­al than the com­ple­ment­ary pro­tect­ive meas­ures that back them up, it would be reas­on­able to apply these same rules. Looking at the clause on Assessment of Faults, [9, 8.2], we find this guid­ance:

For applic­a­tions using inter­lock­ing devices with auto­mat­ic mon­it­or­ing to achieve the neces­sary dia­gnost­ic cov­er­age for the required safety per­form­ance, a func­tion­al test (see IEC 60204 – 1:2005, 9.4.2.4) can be car­ried out every time the device changes its state, e.g. at every access. If, in such a case, there is only infre­quent access, the inter­lock­ing device shall be used with addi­tion­al meas­ures, because between con­sec­ut­ive func­tion­al tests the prob­ab­il­ity of occur­rence of an undetec­ted fault is increased.

When a manu­al func­tion­al test is neces­sary to detect a pos­sible accu­mu­la­tion of faults, it shall be made with­in the fol­low­ing test inter­vals:

  • at least every month for PLe with Category 3 or Category 4 (accord­ing to ISO 13849 – 1) or SIL 3 with HFT (hard­ware fault tol­er­ance) = 1 (accord­ing to IEC 62061);
  • at least every 12 months for PLd with Category 3 (accord­ing to ISO 13849 – 1) or SIL 2 with HFT (hard­ware fault tol­er­ance) = 1 (accord­ing to IEC 62061).

NOTE It is recom­men­ded that the con­trol sys­tem of a machine demands these tests at the required inter­vals e.g. by visu­al dis­play unit or sig­nal lamp. The con­trol sys­tem should mon­it­or the tests and stop the machine if the test is omit­ted or fails.

In the pre­ced­ing, HFT=1 is equi­val­ent to say­ing that the sys­tem is single-​fault tol­er­ant.

This leaves us then with recom­men­ded test fre­quen­cies for Category 2 and 3 archi­tec­tures in PLc, PLd, and PLe, or for SIL 2 and 3 with HFT=1. We still don’t have a test fre­quency for PLc, Category 1 sys­tems. There is no expli­cit guid­ance for these sys­tems in the stand­ards. How can we determ­ine a test rate for these sys­tems?

My approach would be to start by examin­ing the MTTFd val­ues for all of the sub­sys­tems and com­pon­ents. [6] requires that the sys­tem has HIGH MTTFd value, mean­ing 30 years <= MTTFd <= 100 years [6, Table 5]. If this is the case, then the once-​in-​20-​years proof test is the­or­et­ic­ally enough. If the sys­tem is con­struc­ted, for example, as shown Figure 2 below, then each com­pon­ent would have to have an MTTFd > 120 years. See [6, Annex C] for this cal­cu­la­tion.

Basic Stop/Start Circuit
Figure 2 – Basic Stop/​Start Circuit

PB1 – Emergency Stop Button

PB2 – Power “ON” Button

MCR – Master Control Relay

MOV – Surge Suppressor on MCR Coil

M1 – Machine prime mover (motor)

Note that the fuses are not included, since they can only fail to safety, and assum­ing that they were spe­cified cor­rectly in the ori­gin­al design, are not sub­ject to the same cyc­lic­al aging effects as the oth­er com­pon­ents.

M1 is not included since it is the con­trolled por­tion of the machine and is not part of the con­trol sys­tem.

If a review of the com­pon­ents of the sys­tem shows that any single com­pon­ent falls below the tar­get MTTFD, then I would con­sider repla­cing the sys­tem with a high­er cat­egory design. Since most of these com­pon­ents will be unlikely to have MTTFD val­ues on the spec sheet, you will likely have to con­vert from total life val­ues (B10). This is out­side the scope of this art­icle, but you can find guid­ance in [6, Annex C]. More fre­quent test­ing, i.e., more than once in 20 years, is always accept­able.

Where manu­al test­ing is required as part of the design for any cat­egory of sys­tem, and par­tic­u­larly in Category 1 or 2 sys­tems, the con­trol sys­tem should alert the user to the require­ment and not per­mit the machine to oper­ate until the test is com­pleted. This will help to ensure that the requis­ite tests are prop­erly com­pleted.

Need more inform­a­tion? Leave a com­ment below, or send me an email with the details of your applic­a­tion!

Definitions

3.1.9 [8]
func­tion­al safety
part of the over­all safety relat­ing to the EUC and the EUC con­trol sys­tem which depends on the cor­rect func­tion­ing of the E/​E/​PE safety-​related sys­tems, oth­er tech­no­logy safety-​related sys­tems and extern­al risk reduc­tion facil­it­ies
3.2.6 [8]
electrical/​electronic/​programmable elec­tron­ic (E/​E/​PE)
based on elec­tric­al (E) and/​or elec­tron­ic (E) and/​or pro­gram­mable elec­tron­ic (PE) tech­no­logy

NOTE – The term is inten­ded to cov­er any and all devices or sys­tems oper­at­ing on elec­tric­al prin­ciples.

EXAMPLE Electrical/​electronic/​programmable elec­tron­ic devices include

  • elec­tromech­an­ic­al devices (elec­tric­al);
  • solid-​state non-​programmable elec­tron­ic devices (elec­tron­ic);
  • elec­tron­ic devices based on com­puter tech­no­logy (pro­gram­mable elec­tron­ic); see 3.2.5
3.5.1 [8]
safety func­tion
func­tion to be imple­men­ted by an E/​E/​PE safety-​related sys­tem, oth­er tech­no­logy safety-​related sys­tem or extern­al risk reduc­tion facil­it­ies, which is inten­ded to achieve or main­tain a safe state for the EUC, in respect of a spe­cif­ic haz­ard­ous event (see 3.4.1)
3.5.2 [8]
safety integ­rity
prob­ab­il­ity of a safety-​related sys­tem sat­is­fact­or­ily per­form­ing the required safety func­tions under all the stated con­di­tions with­in a stated peri­od of time
NOTE 1 – The high­er the level of safety integ­rity of the safety-​related sys­tems, the lower the prob­ab­il­ity that the safety-​related sys­tems will fail to carry out the required safety func­tions.
NOTE 2 – There are four levels of safety integ­rity for sys­tems (see 3.5.6).
3.5.6 [8]
safety integ­rity level (SIL)
dis­crete level (one out of a pos­sible four) for spe­cify­ing the safety integ­rity require­ments of the safety func­tions to be alloc­ated to the E/​E/​PE safety-​related sys­tems, where safety integ­rity level 4 has the highest level of safety integ­rity and safety integ­rity level 1 has the low­est
NOTE – The tar­get fail­ure meas­ures (see 3.5.13) for the four safety integ­rity levels are spe­cified in tables 2 and 3 of IEC 61508 – 1.
3.6.3 [8]
fault tol­er­ance
abil­ity of a func­tion­al unit to con­tin­ue to per­form a required func­tion in the pres­ence of faults or errors
NOTE – The defin­i­tion in IEV 191 – 15-​05 refers only to sub-​item faults. See the note for the term fault in 3.6.1.
[ISO/​IEC 2382 – 14-​04 – 061]
3.1.1 [6]
safety – related part of a con­trol sys­tem (SRP/​CS)
part of a con­trol sys­tem that responds to safety-​related input sig­nals and gen­er­ates safety-​related out­put sig­nals
NOTE 1 The com­bined safety-​related parts of a con­trol sys­tem start at the point where the safety-​related input sig­nals are ini­ti­ated (includ­ing, for example, the actu­at­ing cam and the roller of the pos­i­tion switch) and end at the out­put of the power con­trol ele­ments (includ­ing, for example, the main con­tacts of a con­tact­or).
NOTE 2 If mon­it­or­ing sys­tems are used for dia­gnostics, they are also con­sidered as SRP/​CS.
3.1.2 [6]
cat­egory
clas­si­fic­a­tion of the safety-​related parts of a con­trol sys­tem in respect of their res­ist­ance to faults and their sub­sequent beha­viour in the fault con­di­tion, and which is achieved by the struc­tur­al arrange­ment of the parts, fault detec­tion and/​or by their reli­ab­il­ity
3.1.3 [6]
fault
state of an item char­ac­ter­ized by the inab­il­ity to per­form a required func­tion, exclud­ing the inab­il­ity dur­ing pre­vent­ive main­ten­ance or oth­er planned actions, or due to lack of extern­al resources

NOTE 1 A fault is often the res­ult of a fail­ure of the item itself, but may exist without pri­or fail­ure.
[IEC 60050 – 191:1990, 05 – 01]

NOTE 2 In this part of ISO 13849, “fault” means ran­dom fault.

3.1.4 [6]
fail­ure
ter­min­a­tion of the abil­ity of an item to per­form a required func­tion

NOTE 1 After a fail­ure, the item has a fault.

NOTE 2 “Failure” is an event, as dis­tin­guished from “fault”, which is a state.

NOTE 3 The concept as defined does not apply to items con­sist­ing of soft­ware only.
[IEC 60050 – 191:1990, 04 – 01]

NOTE 4 Failures which only affect the avail­ab­il­ity of the pro­cess under con­trol are out­side of the scope of this part of ISO 13849.

3.1.5 [6]
dan­ger­ous fail­ure
fail­ure which has the poten­tial to put the SRP/​CS in a haz­ard­ous or fail-​to-​function state

NOTE 1 Whether or not the poten­tial is real­ized can depend on the chan­nel archi­tec­ture of the sys­tem; in redund­ant sys­tems, a dan­ger­ous hard­ware fail­ure is less likely to lead to the over­all dan­ger­ous or fail-​to-​function state.

NOTE 2 Adapted from IEC 61508 – 4:1998, defin­i­tion 3.6.7.

3.1.20 [6]
safety func­tion
func­tion of the machine whose fail­ure can res­ult in an imme­di­ate increase of the risk(s)
[ISO 12100 – 1:2003, 3.28]
3.1.21 [6]
mon­it­or­ing
safety func­tion which ensures that a pro­tect­ive meas­ure is ini­ti­ated if the abil­ity of a com­pon­ent or an ele­ment to per­form its func­tion is dimin­ished or if the pro­cess con­di­tions are changed in such a way that a decrease of the amount of risk reduc­tion is gen­er­ated
3.1.22 [6]
pro­gram­mable elec­tron­ic sys­tem (PES)
sys­tem for con­trol, pro­tec­tion or mon­it­or­ing depend­ent for its oper­a­tion on one or more pro­gram­mable elec­tron­ic devices, includ­ing all ele­ments of the sys­tem such as power sup­plies, sensors and oth­er input devices, con­tact­ors and oth­er out­put devices

NOTE Adapted from IEC 61508 – 4:1998, defin­i­tion 3.3.2.

3.1.23 [6]
per­form­ance level (PL)
dis­crete level used to spe­cify the abil­ity of safety-​related parts of con­trol sys­tems to per­form a safety func­tion under fore­see­able con­di­tions

NOTE See 4.5.1.

3.1.25 [6]
mean time to dan­ger­ous fail­ure (MTTFd)
expect­a­tion of the mean time to dan­ger­ous fail­ure

NOTE Adapted from IEC 62061:2005, defin­i­tion 3.2.34.

3.1.26 [6]
dia­gnost­ic cov­er­age (DC)
meas­ure of the effect­ive­ness of dia­gnostics, which may be determ­ined as the ratio between the fail­ure rate of detec­ted dan­ger­ous fail­ures and the fail­ure rate of total dan­ger­ous fail­ures

NOTE 1 Diagnostic cov­er­age can exist for the whole or parts of a safety-​related sys­tem. For example, dia­gnost­ic cov­er­age could exist for sensors and/​or logic sys­tem and/​or final ele­ments.

NOTE 2 Adapted from IEC 61508 – 4:1998, defin­i­tion 3.8.6.

3.1.33 [6]
safety integ­rity level (SIL)
dis­crete level (one out of a pos­sible four) for spe­cify­ing the safety integ­rity require­ments of the safety func­tions to be alloc­ated to the E/​E/​PE safety-​related sys­tems, where safety integ­rity level 4 has the highest level of safety integ­rity and safety integ­rity level 1 has the low­est
[IEC 61508 – 4:1998, 3.5.6]

Acknowledgements

Thanks to my col­leagues Derek Jones and Jonathan Johnson, both from Rockwell Automation, and mem­bers of ISO TC199. Their sug­ges­tion to ref­er­ence ISO 14119 clause 8.2 was the seed for this art­icle.

I’d also like to acknow­ledge Ronald Sykes, Howard Touski, Mirela Moga, Michael Roland, and Grant Rider for ask­ing the ques­tions that lead to this art­icle.

References

[1]     Safety of machinery — General prin­ciples for design — Risk assess­ment and risk reduc­tion. ISO 12100. International Organization for Standardization (ISO). Geneva 2010.

[2]    Safeguarding of Machinery. CSA Z432. Canadian Standards Association. Toronto. 2004.

[3]    Safety of machinery – Emergency stop – Principles for design. ISO 13850. International Organization for Standardization (ISO). Geneva 2006.

[4]    Electrical Standard for Industrial Machinery. NFPA 79. National Fire Protection Association (NFPA). Batterymarch Park. 2015

[5]    Safety of machinery – Electrical equip­ment of machines – Part 1: General require­ments. IEC 60204 – 1. International Electrotechnical Commission (IEC). Geneva. 2009.

[6]    Safety of machinery — Safety-​related parts of con­trol sys­tems — Part 1: General prin­ciples for design.  ISO 13849 – 1. International Organization for Standardization (ISO). Geneva. 2006.

[7]    Safety of machinery — Risk assess­ment — Part 2: Practical guid­ance and examples of meth­ods. ISO/​TR 14121 – 2. International Organization for Standardization (ISO). Geneva. 2012.

[8]   Safety of machinery – Functional safety of safety-​related elec­tric­al, elec­tron­ic and pro­gram­mable elec­tron­ic con­trol sys­tems. IEC 62061. International Electrotechnical Commission (IEC). Geneva. 2005.

[9]    D. J. Wilkins (2002, November). “The Bathtub Curve and Product Failure Behavior. Part One – The Bathtub Curve, Infant Mortality and Burn-​in”. Reliability Hotline [Online]. Available: http://​www​.weibull​.com/​h​o​t​w​i​r​e​/​i​s​s​u​e​2​1​/​h​o​t​t​o​p​i​c​s​2​1​.​htm. [Accessed: 26-​Apr-​2015].

[10] Functional safety of electrical/​electronic/​programmable elec­tron­ic safety-​related sys­tems – Part 4: Definitions and abbre­vi­ations. IEC 61508 – 4. International Electrotechnical Commission (IEC). Geneva. 1998.

[11] Safety of machinery — Interlocking devices asso­ci­ated with guards — Principles for design and selec­tion. ISO 14119. International Organization for Standardization (ISO). Geneva. 2013.

Sources for Standards

CANADA

Canadian Standards Association sells CSA, ISO and IEC stand­ards to the Canadian Market.

USA

ANSI offers stand­ards from most US Standards Development Organizations. They also sell ISO and IEC stand­ards into the US mar­ket.


International

International Organization for Standardization (ISO).

International Electrotechnical Commission (IEC).

Europe

Each EU mem­ber state has their own stand­ards body. For reas­ons unknown to me, each stand­ards body can set their own pri­cing for the doc­u­ments they sell. All offer English lan­guage cop­ies, in addi­tion to cop­ies in the offi­cial language(s) of the mem­ber state. My best advice is to shop around a bit. Prices can vary by as much as 10:1.

British Standards Institute (BSi) $$$

Danish Standards (DS) $

Estonian Standards (EVS) $

German stand­ards (DIN) – Beuth Verlag GmbH