Updated 2020-10-31. DN
I’ve had a number of questions from readers regarding testing of emergency stop systems, and particularly with the frequency of testing. I addressed the types of tests that might be needed in another article covering Checking Emergency Stop Systems. This article will focus on the frequency of testing rather than the types of tests.
The Problem
Emergency stop systems are considered “complementary protective measures” in key machinery safety standards like ISO 12100 [1] and CSA Z432 [2]; this makes emergency stop systems the backup to the primary safeguards. Complimentary protective measures are intended to facilitate “avoiding or limiting the harm” that may result from an emergent situation. By definition, this is a situation that has not been foreseen by the machine builder or is the result of another failure. This could be a failure of another safeguarding system or a failure in the machine that is not controlled by other means., e.g., a workpiece shatters due to a material flaw, and the broken pieces damage the machine, creating new, uncontrolled, failure conditions in the machine.
Emergency stop systems are manually triggered and should be infrequently used. The lack of use means that functional testing of the system doesn’t happen in the normal course of operation of the machinery, unlike an interlocked guard door for example. Some types of faults may occur and remain undetected until the system is used, i.e., contact blocks falling off the back of the operator device. Failure at that point may be catastrophic, since by implication the primary safeguards have already failed, and thus the failure of the backup eliminates the possibility of avoiding or limiting harm.
To understand the testing requirements, it’s important to understand the risk and reliability requirements that drive the design of emergency stop systems and then get into the test frequency question.
Requirements
In the past, there were no explicit requirements for emergency stop system reliability. Details like the colour of the operator device, or the way the stop function worked were defined in ISO 13850 [3], NFPA 79 [4], and IEC 60204-1 [5]. In the 3rd edition of ISO 13850, published in 2015, a new provision requiring emergency stop systems to meet at least PLc was added [6], guiding the designer to implement at least that Performance Level. To determine the requirements for any safety function, the key is to start with the risk assessment. The risk assessment process requires that the designer understand the stage in the life cycle of the machine, the task(s) that will be done, and the specific hazards that a worker may be exposed to while conducting the task. This can become quite complex when considering maintenance and service tasks, and also applies to foreseeable failure modes of the machinery or the process. The scoring or ranking of risk can be accomplished using any suitable risk scoring tool that meets the minimum requirements in [1]. There are some good examples given in ISO/TR 14121-2 [7] if you are looking for some guidance. There are many good engineering textbooks available as well. Have a look at our Book List for some suggestions if you want a deeper dive.
Reliability
Once the initial unmitigated risk is understood, risk control measures can be specified. Wherever the control system is used as part of the risk control measure, a safety function must be specified. Specification of the safety function includes the Performance Level (PL), structure category (B, 1-4), Mean Time to Dangerous Failure (MTTFD), Diagnostic Coverage (DCavg) [6], or Safety Integrity Level (SIL), and Hardware Fault Tolerance (HFT), as described in IEC 62061 [8], as a minimum. If you are unfamiliar with these terms, see the definitions at the end of the article.
Referring to Figure 1, the “Risk Graph” [6, Annex A], we can reasonably state that for most machinery, a failure mode or emergent condition is likely to create conditions where the severity of the injury is likely to require more than basic first aid, so selecting “S2” is the first step. In these situations, and particularly where the failure modes are not well understood, the highest level of severity of injury, S2, is selected because we don’t have enough information to expect that the injuries would only be minor. As soon as we make this selection, it is no longer possible to select any combination of Frequency or Probability parameters that will result in anything lower than PLc.
It’s important to understand that Figure 1 is not a risk assessment tool, but rather a decision tree used to select an appropriate PL based on the relevant risk parameters. Those parameters are:
Severity of Injury | frequency and/or exposure to hazard | possibility of avoiding hazard or limiting harm |
---|---|---|
S1 – slight (normally reversible injury) | F1 – seldom-to-less-often and/or exposure time is short | P1 – possible under specific conditions |
S2 – serious (normally irreversible injury or death) | F2 – frequent-to-continuous and/or exposure time is long | P2 – scarcely possible |
PLc can be accomplished using three structures: Category 1, 2, or 3. If you are unsure what these structures represent, look at the series covering this topic.
Category 1 is single-channel and does not include any diagnostics. A single fault can cause the loss of the safety function (i.e., the machine still runs even though the e-stop button is pressed). Using Category 1, the reliability of the design is based on the use of highly reliable components, basic and well-tried safety principles. This approach can fail to danger.
Category 2 adds some diagnostic capability to the basic single-channel configuration and does not require using “well-tried” components. This approach can also fail to danger due to the single functional channel.
Category 3 structure adds a redundant channel and includes diagnostic coverage (DCavg minimum = 65%). Category 3 is not subject to failure due to single faults and is called “single-fault tolerant”. This approach is less likely to fail to danger but still can in the presence of multiple, undetected faults and some common-cause failures.
A key concept in reliability is the “fault.” IEC defines the term “fault” in several ways (see electropedia.org), but the most fundamental definition is:
fault,
inability to perform as required, due to an internal stateNote 1 to entry: A fault of an item results from a failure, either of the item itself, or from a deficiency in an earlier stage of the life cycle, such as specification, design, manufacture or maintenance. See latent fault (192-04-08).
Note 2 to entry: Qualifiers, such as specification, design, manufacture, maintenance or misuse, may be used to indicate the cause of a fault.
Note 3 to entry: The type of fault may be associated with the type of associated failure, e.g. wear-out fault and wear-out failure.
Note 4 to entry: The adjective “faulty” designates an item having one or more faults.
IEC 60050, IEV 192-04-01
Faults can be any kind of hardware or software defect resulting in unwanted behaviour or failure. Faults are further broken down into dangerous and safe faults, meaning those that result in a dangerous outcome, and those that do not. Finally, each class is broken down into detectable and undetectable faults. I’m not going to get into the mathematical treatment of these classes, but my point is this: there are undetectable dangerous faults. These are faults that built-in diagnostics cannot detect. As designers, we try to design the control system so that undetectable dangerous faults are extremely rare; ideally, the probability should be much less than once in the machine’s lifetime.
What is the lifetime of the machine? The standards writers have settled on a default lifetime of 20 years; thus, the answer is that undetectable dangerous failures should happen much less than once in twenty years of 24/7/365 operation. So why does this matter? Each architectural category has different requirements for testing. The Demand Rate drives the test rate. The Demand Rate is defined in [6]. “SRP/CS” stands for “Safety Related Part of the Control System” in the definition:
3.1.30
demand rate (rd) – frequency of demands for a safety-related action of the SRP/CS
Each time the emergency stop button is pressed, a “demand” is put on the system. Looking at the “Simplified Procedure for estimating PL”, [6, 4.5.4], we find that the standard makes the following assumptions:
- mission time, 20 years (see Clause 10);
- constant failure rates within the mission time;
- for category 2, demand rate <= 1/100 test rate;
- for category 2, MTTFDTE larger than half of MTTFDL.
NOTE When blocks of each channel cannot be separated, the following can be applied: MTTFD of the summarized test channel (TE, OTE) larger than half MTTFD of the summarized functional channel (I, L, O).
So what does all that mean? The 20-year mission time is the assumed lifetime of the machinery. This number underpins the rest of the calculations in the standard and is based on the idea that few modern control systems last longer than 20 years without being replaced or rebuilt. The constant failure rate points to the idea that systems used in the field will have components and controls that are not subject to infant mortality, nor are they old enough to start to fail due to age, but rather that the system is operating in the flat portion of the standardized failure rate “bathtub curve,” [9]. See Figure 2. Components subject to infant mortality failed at the factory and were removed from the supply chain. Those failing from “wear-out” are expected to reach that point after 20 years. If this is not the case, then the maintenance instructions for the system should include preventative maintenance tasks that require replacing critical components before they reach the predicted MTTFd.
For systems using the Category 2 structure, the automatic diagnostic test rate must be at least 100x the demand rate. Keep in mind that this test rate is normally accomplished automatically in the design of the controls and is only related to the detectable safe or dangerous faults. Undetectable faults must have a probability of less than once in 20 years and should be detected by the “proof test.” More on that a bit later.
Finally, the MTTFD of the functional channel must be at least twice that of the diagnostic system.
Category 1 has no diagnostics, so there is no guidance in [6] to help us with these systems. Category 3 is single fault tolerant, so as long as we don’t have multiple undetected faults, we can count on the system to function and alert us when a single fault occurs; remember that automatic tests may not be able to detect every fault. This is where the “proof test” comes in. What is a proof test? To find a definition for the proof test, we have to look at IEC 61508-4 [10]:
3.8.5
proof test
periodic test performed to detect failures in a safety-related system so that, if necessary, the system can be restored to an “as new” condition or as close as practical to this conditionNOTE – The effectiveness of the proof test will be dependent upon how close to the “as new” condition the system is restored. For the proof test to be fully effective, it will be necessary to detect 100% of all dangerous failures. Although in practice 100% is not easily achieved for other than low-complexity E/E/PE safety-related systems, this should be the target. As a minimum, all the safety functions which are executed are checked according to the E/E/PES safety requirements specification. If separate channels are used, these tests are done for each channel separately.
The 20-year life cycle assumption used in the standards also applies to proof testing. Machine controls are assumed to get at least one proof test in their lifetime. The proof test should be designed to detect faults that automatic diagnostics cannot. Proof tests are also conducted after major rebuilds and repairs to ensure the system operates correctly.
If you know the structure of the emergency stop control system, you can determine the test rate based on the demand rate. It would be considerably easier if the standards just gave us minimum test rates for the various architectures. One standard, ISO 14119 [11] on interlocks, does just that. This standard does not include emergency stop functions within its scope, as its focus is on interlocks. Still, since interlocking systems are more critical than the complimentary protective measures that back them up, it would be reasonable to apply these same rules. Looking at the clause on Assessment of Faults, [9, 8.2], we find this guidance:
For applications using interlocking devices with automatic monitoring to achieve the necessary diagnostic coverage for the required safety performance, a functional test (see IEC 60204-1:2005, 9.4.2.4) can be carried out every time the device changes its state, e.g. at every access. If, in such a case, there is only infrequent access, the interlocking device shall be used with additional measures, because between consecutive functional tests the probability of occurrence of an undetected fault is increased.
When a manual functional test is necessary to detect a possible accumulation of faults, it shall be made within the following test intervals:
- at least every month for PLe with Category 3 or Category 4 (according to ISO 13849-1) or SIL 3 with HFT (hardware fault tolerance) = 1 (according to IEC 62061);
- at least every 12 months for PLd with Category 3 (according to ISO 13849-1) or SIL 2 with HFT (hardware fault tolerance) = 1 (according to IEC 62061).
NOTE It is recommended that the control system of a machine demands these tests at the required intervals e.g. by visual display unit or signal lamp. The control system should monitor the tests and stop the machine if the test is omitted or fails.
In the preceding, HFT=1 is equivalent to saying that the system is single-fault tolerant.
This leaves us with recommended test frequencies for Category 2 and 3 architectures in PLc, PLd, and PLe, or SIL 2 and 3 with HFT=1. We still don’t have a test frequency for PLc, Category 1 systems. There is no explicit guidance for these systems in the standards. How can we determine a test rate for these systems?
My approach would be to start by examining the MTTFD values for all of the subsystems and components. [6] requires that the system has a HIGH MTTFD value, meaning 30 years ≤ MTTFD ≤ 100 years [6, Table 5]. If this is the case, then the once-in-20-years proof test is theoretically enough. For example, if the system is constructed, as shown in Figure 2 below, each component would have to have an MTTFD > 120 years. See [6, Annex C] for this calculation.
PB1 – Emergency Stop Button
PB2 – Power “ON” Button
MCR – Master Control Relay
MOV – Surge Suppressor on MCR Coil
M1 – Machine prime mover (motor)
Note that the fuses are not included since they can only fail to safety, and assuming that they were specified correctly in the original design, are not subject to the same cyclical aging effects as the other components.
M1 is not included since it is the controlled portion of the machine and is not part of the control system.
If a review of the components of the system shows that any single component falls below the target MTTFD, then I would consider replacing the system with a higher category design. Since most of these components will unlikely have MTTFD values on the spec sheet, you will likely have to convert from total life values (B10). This is outside the scope of this article, but you can find guidance in [6, Annex C]. More frequent testing, i.e., more than once in 20 years, is always acceptable.
Where manual testing is required as part of the design for any system category, particularly in Category 1 or 2 systems, the control system should alert the user to the requirement and not permit the machine to operate until the test is completed. This will help to ensure that the requisite tests are properly completed.
Need more information? Leave a comment below, or email me the details of your application!
Definitions
- 3.1.9 [8]
functional safety - part of the overall safety relating to the EUC and the EUC control system which depends on the correct functioning of the E/E/PE safety-related systems, other technology safety-related systems and external risk reduction facilities
- 3.2.6 [8]
electrical/electronic/programmable electronic (E/E/PE) - based on electrical (E) and/or electronic (E) and/or programmable electronic (PE) technology
NOTE – The term is intended to cover any and all devices or systems operating on electrical principles.
EXAMPLE Electrical/electronic/programmable electronic devices include
- electromechanical devices (electrical);
- solid-state non-programmable electronic devices (electronic);
- electronic devices based on computer technology (programmable electronic); see 3.2.5
- 3.5.1 [8]
safety function - function to be implemented by an E/E/PE safety-related system, other technology safety-related system or external risk reduction facilities, which is intended to achieve or maintain a safe state for the EUC, in respect of a specific hazardous event (see 3.4.1)
- 3.5.2 [8]
safety integrity - probability of a safety-related system satisfactorily performing the required safety functions under all the stated conditions within a stated period of time
NOTE 1 – The higher the level of safety integrity of the safety-related systems, the lower the probability that the safety-related systems will fail to carry out the required safety functions.
NOTE 2 – There are four levels of safety integrity for systems (see 3.5.6). - 3.5.6 [8]
safety integrity level (SIL) - discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest
NOTE – The target failure measures (see 3.5.13) for the four safety integrity levels are specified in tables 2 and 3 of IEC 61508-1. - 3.6.3 [8]
fault tolerance - ability of a functional unit to continue to perform a required function in the presence of faults or errors
NOTE – The definition in IEV 191-15-05 refers only to sub-item faults. See the note for the term fault in 3.6.1.
[ISO/IEC 2382-14-04-061] - 3.1.1 [6]
safety related part of a control system (SRP/CS) - part of a control system that responds to safety-related input signals and generates safety-related output signals
- NOTE 1 The combined safety-related parts of a control system start at the point where the safety-related input signals are initiated (including, for example, the actuating cam and the roller of the position switch) and end at the output of the power control elements (including, for example, the main contacts of a contactor).
- NOTE 2 If monitoring systems are used for diagnostics, they are also considered as SRP/CS.
- 3.1.2 [6]
category - classification of the safety-related parts of a control system in respect of their resistance to faults and their subsequent behaviour in the fault condition, and which is achieved by the structural arrangement of the parts, fault detection and/or by their reliability
- 3.1.3 [6]
fault - state of an item characterized by the inability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to lack of external resources
- NOTE 1 A fault is often the result of a failure of the item itself, but may exist without prior failure.
[IEC 60050-191:1990, 05-01] - NOTE 2 In this part of ISO 13849, “fault” means random fault.
- 3.1.4 [6]
failure - termination of the ability of an item to perform a required function
- NOTE 1 After a failure, the item has a fault.
- NOTE 2 “Failure” is an event, as distinguished from “fault,” which is a state.
- NOTE 3 The concept as defined does not apply to items consisting of software only.
[IEC 60050-191:1990, 04-01] - NOTE 4 Failures which only affect the availability of the process under control are outside of the scope of this part of ISO 13849.
- 3.1.5 [6]
dangerous failure - failure which has the potential to put the SRP/CS in a hazardous or fail-to-function state
- NOTE 1 Whether or not the potential is realized can depend on the channel architecture of the system; in redundant systems, a dangerous hardware failure is less likely to lead to the overall dangerous or fail-to-function state.
- NOTE 2 Adapted from IEC 61508-4:1998, definition 3.6.7.
- 3.1.20 [6]
safety function - function of the machine whose failure can result in an immediate increase of the risk(s)
[ISO 12100-1:2003, 3.28] - 3.1.21 [6]
monitoring - safety function which ensures that a protective measure is initiated if the ability of a component or an element to perform its function is diminished or if the process conditions are changed in such a way that a decrease of the amount of risk reduction is generated
- 3.1.22 [6]
programmable electronic system (PES) - system for control, protection or monitoring dependent for its operation on one or more programmable electronic devices, including all elements of the system such as power supplies, sensors and other input devices, contactors and other output devices
- NOTE Adapted from IEC 61508-4:1998, definition 3.3.2.
- 3.1.23 [6]
performance level (PL) - discrete level used to specify the ability of safety-related parts of control systems to perform a safety function under foreseeable conditions
- NOTE See 4.5.1.
- 3.1.25 [6]
mean time to dangerous failure (MTTFd) - expectation of the mean time to dangerous failure
- NOTE Adapted from IEC 62061:2005, definition 3.2.34.
- 3.1.26 [6]
diagnostic coverage (DC) - measure of the effectiveness of diagnostics, which may be determined as the ratio between the failure rate of detected dangerous failures and the failure rate of total dangerous failures
- NOTE 1 Diagnostic coverage can exist for the whole or parts of a safety-related system. For example, diagnostic coverage could exist for sensors and/or logic system and/or final elements.
- NOTE 2 Adapted from IEC 61508-4:1998, definition 3.8.6.
- 3.1.33 [6]
safety integrity level (SIL) - discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest
[IEC 61508-4:1998, 3.5.6]
Acknowledgements
Thanks to my colleagues Derek Jones and Jonathan Johnson, both from Rockwell Automation and members of ISO TC199. Their suggestion to reference ISO 14119 clause 8.2 was the seed for this article.
I’d also like to acknowledge Ronald Sykes, Howard Touski, Mirela Moga, Michael Roland, and Grant Rider for asking the questions that led to this article.
References
[1] Safety of machinery — General principles for design — Risk assessment and risk reduction, ISO 12100. International Organization for Standardization (ISO). Geneva 2010.
[2] Safeguarding of Machinery, CSA Z432. Canadian Standards Association. Toronto. 2004.
[3] Safety of machinery — Emergency stop — Principles for design, ISO 13850. International Organization for Standardization (ISO). Geneva 2006.
[4] Electrical Standard for Industrial Machinery, NFPA 79. National Fire Protection Association (NFPA). Batterymarch Park. 2015
[5] Safety of machinery — Electrical equipment of machines — Part 1: General requirements, IEC 60204-1. International Electrotechnical Commission (IEC). Geneva. 2009.
[6] Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. ISO 13849-1. International Organization for Standardization (ISO). Geneva. 2006.
[7] Safety of machinery — Risk assessment — Part 2: Practical guidance and examples of methods, ISO/TR 14121-2. International Organization for Standardization (ISO). Geneva. 2012.
[8] Safety of machinery — Functional safety of safety-related electrical, electronic and programmable electronic control systems, IEC 62061. International Electrotechnical Commission (IEC). Geneva. 2005.
[9] D. J. Wilkins (2002, November). “The Bathtub Curve and Product Failure Behavior. Part One – The Bathtub Curve, Infant Mortality and Burn-in”. Reliability Hotline [Online]. Available: http://www.weibull.com/hotwire/issue21/hottopics21.htm. [Accessed: 26-Apr-2015].
[10] Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 4: Definitions and abbreviations. IEC 61508-4. International Electrotechnical Commission (IEC). Geneva. 1998.
[11] Safety of machinery — Interlocking devices associated with guards — Principles for design and selection, ISO 14119. International Organization for Standardization (ISO). Geneva. 2013.
Sources for Standards
CANADA
Canadian Standards Association sells CSA, ISO and IEC standards to the Canadian Market.
USA
ANSI offers standards from most US Standards Development Organizations. They also sell ISO and IEC standards in the US market.
International
International Organization for Standardization (ISO).
International Electrotechnical Commission (IEC).
Europe
Each EU member state has its own standards body. For unknown reasons, each standards body can set its pricing for the documents they sell. All offer English language copies and copies in the member state’s official language(s). My best advice is to shop around a bit. Prices can vary by as much as 10:1.
British Standards Institute (BSi) $$$
German standards (DIN) – Beuth Verlag GmbH $$
© 2015 – 2022, Compliance inSight Consulting Inc.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
It’s good to know that emergency stop systems are safeguards for emergency situations. My uncle is thinking about hiring a consultant to help the reliability of some of his machines to prevent any accidents. I’ll have to send him this as he looks for a consultant that has a lot of experience with his machines.
Hi Taylor, I’m glad my articles are helpful to you. Emergency stops are complimentary protective measures, not safeguards. This is an important, if fine, point. Safeguards function automatically, without the need for any intentional human action. Complimentary protective measures, emergency stop, are backup systems that require an intentional human action to activate. They are only there for use when the primary safeguards fail. This is why an emergency stop cannot be the only risk reduction measure used on a machine. There must be some other primary safeguarding.
Good point out there!
It’s good to know that emergency stop systems are safeguards for emergency situations. My uncle is thinking about hiring a consultant to help the reliability of some of his machines to prevent any accidents. I’ll have to send him this as he looks for a consultant that has a lot of experience with his machines.
Hi Taylor, I’m glad my articles are helpful to you. Emergency stops are complimentary protective measures, not safeguards. This is an important, if fine, point. Safeguards function automatically, without the need for any intentional human action. Complimentary protective measures, emergency stop, are backup systems that require an intentional human action to activate. They are only there for use when the primary safeguards fail. This is why an emergency stop cannot be the only risk reduction measure used on a machine. There must be some other primary safeguarding.