Testing Emergency Stop Systems

This entry is part 11 of 11 in the series Emergency Stop

Emergency Stop on machine consoleI’ve had a number of questions from readers regarding testing of emergency stop systems, and particularly with the frequency of testing. I addressed the types of tests that might be needed in another article covering Checking Emergency Stop Systems. This article will focus on the frequency of testing rather than the types of tests.

The Problem

Emergency stop systems are considered to be “complementary protective measures” in key machinery safety standards like ISO 12100 [1], and CSA Z432 [2]; this makes emergency stop systems the backup to the primary safeguards. Complementary protective measures are intended to permit “avoiding or limiting the harm” that may result from an emergent situation. By definition, this is a situation that has not been foreseen by the machine builder, or is the result of another failure. This could be a failure of another safeguarding system, or a failure in the machine that is not controlled by other means., e.g., a workpiece shatters due to a material flaw, and the broken pieces damage the machine, creating new, uncontrolled, failure conditions in the machine.

Emergency stop systems are manually triggered, and usually infrequently used. The lack of use means that functional testing of the system doesn’t happen in the normal course of operation of the machinery. Some types of faults may occur and remain undetected until the system is actually used, i.e., contact blocks falling off the back of the operator device. Failure at that point may be catastrophic, since by implication the primary safeguards have already failed, and thus the failure of the backup eliminates the possibility of avoiding or limiting harm.

To understand the testing requirements, it’s important to understand the risk and reliability requirements that drive the design of emergency stop systems, and then get into the test frequency question.


In the past, there were no explicit requirements for emergency stop system reliability. Details like the colour of the operator device, or the way the stop function worked were defined in ISO 13850 [3], NFPA 79 [4], and IEC 60204-1 [5]. In the soon-to-be published 3rd edition of ISO 13850, a new provision requiring emergency stop systems to meet at least PLc will be added [6], but until publication, it is up to the designer to determine the safety integrity level, either PL or SIL, required. To determine the requirements for any safety function, the key is to start at the risk assessment. The risk assessment process requires that the designer understand the stage in the life cycle of the machine, the task(s) that will be done, and the specific hazards that a worker may be exposed to while conducting the task. This can become quite complex when considering maintenance and service tasks, and also applies to foreseeable failure modes of the machinery or the process. The scoring or ranking of risk can be accomplished using any suitable risk scoring tool that meets the minimum requirements in [1]. There are some good examples given in ISO/TR 14121-2 [7] if you are looking for some guidance. There are many good engineering textbooks available as well. Have a look at our Book List for some suggestions if you want a deeper dive.


Once the initial unmitigated risk is understood, risk control measures can be specified. Wherever the control system is used as part of the risk control measure, a safety function must be specified. Specification of the safety function includes the Performance Level (PL), architectural category (B, 1-4), Mean Time to Dangerous Failure (MTTFd), and Diagnostic Coverage (DC) [6], or Safety Integrity Level (SIL), and Hardware Fault Tolerance (HFT), as described in IEC 62061 [8], as a minimum. If you are unfamiliar with these terms, see the definitions at the end of the article.

Referring to Figure 1, the “Risk Graph” [6, Annex A], we can reasonably state that for most machinery, a failure mode or emergent condition is likely to create conditions where the severity of injury is likely to require more than basic first aid, so selecting “S2” is the first step. In these situations, and particularly where the failure modes are not well understood, the highest level of severity of injury, S2, is selected because we don’t have enough information to expect that the injuries would only be minor. As soon as we make this selection, it is no longer possible to select any combination of Frequency or Probability parameters that will result in anything lower than PLc.

It’s important to understand that Figure 1 is not a risk assessment tool, but rather a decision tree used to select an appropriate PL based on the relevant risk parameters. Those parameters are:

Table 1 – Risk Parameters
Severity of Injury frequency and/or exposure to hazard possibility of avoiding hazard or limiting harm
S1 – slight (normally reversible injury) F1 – seldom-to-less-often and/or exposure time is short P1 – possible under specific conditions
S2 – serious (normally irreversible injury or death) F2 – frequent-to-continuous and/or exposure time is long P2 – scarcely possible
Decision tree used to determine PL based on risk parameters.
Figure 1 – “Risk Graph” for determining PL

PLc can be accomplished using any of three architectures: Category 1, 2, or 3. If you are unsure about what these architectures represent, have a look at my series covering this topic.

Category 1 is single channel, and does not include any diagnostics. A single fault can cause the loss of the safety function (i.e., the machine still runs even though the e-stop button is pressed). Using Category 1, the reliability of the design is based on the use of highly reliable components and well-tried safety principles. This approach can fail to danger.

Category 2 adds some diagnostic capability to the basic single channel configuration, and does not require the use of “well-tried” components. This approach can also fail to danger.

Category 3 architecture adds a redundant channel, and includes diagnostic coverage. Category 3 is not subject to failure due to single faults and is called “single-fault tolerant”. This approach is less likely to fail to danger, but still can in the presence of multiple, undetected, faults.

A key concept in reliability is the “fault”. This can be any kind of defect in hardware or software that results in unwanted behaviour or a failure. Faults are further broken down into dangerous and safe faults, meaning those that result in a dangerous outcome, and those that do not. Finally, each of these classes is broken down into detectable and undetectable faults. I’m not going to get into the mathematical treatment of these classes, but my point is this: there are undetectable dangerous faults. These are faults that cannot be detected by built-in diagnostics. As designers, we try to design the control system so that the undetectable dangerous faults are extremely rare, ideally the probability should be much less than once in the lifetime of the machine.

What is the lifetime of the machine? The standards writers have settled on a default lifetime of 20 years, thus the answer is that undetectable dangerous failures should happen much less than once in twenty years of 24/7/365 operation. So why does this matter? Each architectural category has different requirements for testing. The test rates are driven by the “Demand Rate”. The Demand Rate is defined in [6]. “SRP/CS” stands for “Safety Related Part of the Control System” in the definition:

demand rate (rd) – frequency of demands for a safety-related action of the SRP/CS

Each time the emergency stop button is pressed, a “demand” is put on the system. Looking at the “Simplified Procedure for estimating PL”, [6, 4.5.4], we find that the standard makes the following assumptions:

  • mission time, 20 years (see Clause 10);
  • constant failure rates within the mission time;
  • for category 2, demand rate <= 1/100 test rate;
  • for category 2, MTTFd,TE larger than half of MTTFd,L.

NOTE When blocks of each channel cannot be separated, the following can be applied: MTTFd of the summarized test channel (TE, OTE) larger than half MTTFd of the summarized functional channel (I, L, O).

So what does all that mean? The 20-year mission time is the assumed lifetime of the machinery. This number underpins the rest of the calculations in the standard, and is based on the idea that few modern control systems last longer than 20 years without being replaced or rebuilt. The constant failure rate points at the idea that systems used in the field will have components and controls that are not subject to infant mortality, nor are they old enough to start to fail due to age, but rather that the system is operating in the flat portion of the standardized failure rate “bathtub curve”, [9]. See Figure 2. Components that are subject to infant mortality failed at the factory and were removed from the supply chain. Those failing from “wear-out” are expected to reach that point after 20 years. If this is not the case, then the maintenance instructions for the system should include preventative maintenance tasks that require replacing critical components before they reach the predicted MTTFd.

Diagram of a standardized bathtub-shaped failure rate curve.
Figure 2 – Weibull Bathtub Curve [9]
For systems using Category 2 architecture, the automatic diagnostic test rate must be at least 100x the demand rate. Keep in mind that this test rate is normally accomplished automatically in the design of the controls, and is only related to the detectable safe or dangerous faults. Undetectable faults must have a probability of less than once in 20 years, and should be detected by the “proof test”. More on that a bit later.

Finally, the MTTFd of the functional channel must be at least twice that of the diagnostic system.

Category 1 has no diagnostics, so there is no guidance in [6] to help us out with these systems. Category 3 is single fault tolerant, so as long as we don’t have multiple undetected faults we can count on the system to function and to alert us when a single fault occurs; remember that the automatic tests may not be able to detect every fault. This is where the “proof test” comes in. What is a proof test? To find a definition for proof test, we have to look at IEC 61508-4 [10]:

proof test
periodic test performed to detect failures in a safety-related system so that, if necessary, the system can be restored to an “as new” condition or as close as practical to this condition

NOTE – The effectiveness of the proof test will be dependent upon how close to the “as new” condition the system is restored. For the proof test to be fully effective, it will be necessary to detect 100 % of all dangerous failures. Although in practice 100 % is not easily achieved for other than low-complexity E/E/PE safety-related systems, this should be the target. As a minimum, all the safety functions which are executed are checked according to the E/E/PES safety requirements specification. If separate channels are used, these tests are done for each channel separately.

The 20-year life cycle assumption used in the standards also applies to proof testing. Machine controls are assumed to get at least one proof test in their life time. The proof test should be designed to detect faults that the automatic diagnostics cannot detect. Proof tests are also conducted after major rebuilds and repairs to ensure that the system operates correctly.

If you know the architecture of the emergency stop control system, you can determine the test rate based on the demand rate. It would be considerably easier if the standards just gave us some minimum test rates for the various architectures. One standard, ISO 14119 [11] on interlocks does just that. Admittedly, this standard does not include emergency stop functions within its scope, as its focus is on interlocks, but since interlocking systems are more critical than the complementary protective measures that back them up, it would be reasonable to apply these same rules. Looking at the clause on Assessment of Faults, [9, 8.2], we find this guidance:

For applications using interlocking devices with automatic monitoring to achieve the necessary diagnostic coverage for the required safety performance, a functional test (see IEC 60204-1:2005, can be carried out every time the device changes its state, e.g. at every access. If, in such a case, there is only infrequent access, the interlocking device shall be used with additional measures, because between consecutive functional tests the probability of occurrence of an undetected fault is increased.

When a manual functional test is necessary to detect a possible accumulation of faults, it shall be made within the following test intervals:

  • at least every month for PL e with Category 3 or Category 4 (according to ISO 13849-1) or SIL 3 with HFT (hardware fault tolerance) = 1 (according to IEC 62061);
  • at least every 12 months for PL d with Category 3 (according to ISO 13849-1) or SIL 2 with HFT (hardware fault tolerance) = 1 (according to IEC 62061).

NOTE It is recommended that the control system of a machine demands these tests at the required intervals e.g. by visual display unit or signal lamp. The control system should monitor the tests and stop the machine if the test is omitted or fails.

In the preceding, HFT=1 is equivalent to saying that the system is single-fault tolerant.

This leaves us then with recommended test frequencies for Category 2 and 3 architectures in PLc, PLd, and PLe, or for SIL 2 and 3 with HFT=1. We still don’t have a test frequency for PLc, Category 1 systems. There is no explicit guidance for these systems in the standards. How can we determine a test rate for these systems?

My approach would be to start by examining the MTTFd values for all of the subsystems and components. [6] requires that the system have HIGH MTTFd value, meaning 30 years <= MTTFd <= 100 years [6, Table 5]. If this is the case, then the once-in-20-years proof test is theoretically enough. If the system is constructed, for example, as shown Figure 2 below, then each component would have to have an MTTFd > 120 years. See [6, Annex C] for this calculation.

Basic Stop/Start Circuit
Figure 2 – Basic Stop/Start Circuit

PB1 – Emergency Stop Button

PB2 – Power “ON” Button

MCR – Master Control Relay

MOV – Surge Suppressor on MCR Coil

M1 – Machine prime-mover (motor)

Note that the fuses are not included, since they can only fail to safety, and assuming that they were specified correctly in the original design, are not subject to the same cyclical aging effects as the other components.

M1 is not included, since it is the controlled portion of the machine and is not part of the control system.

If a review of the components in the system shows that any single component falls below the target MTTFd, then I would consider replacing the system with a higher category design. Since most of these components will be unlikely to have MTTFd values on the spec sheet, you will likely have to convert from total life values (B10). This is outside the scope of this article, but you can find guidance in [6, Annex C]. More frequent testing, i.e., more than once in 20 years, is always acceptable.

Where manual testing is required as part of the design for any category of system, and particularly in Category 1 or 2 systems, the control system should alert the user to the requirement and not permit the machine to operate until the test is completed. This will help to ensure that the requisite tests are properly completed.

Need more information? Leave a comment below, or send me an email with the details of your application!


3.1.9 [8]
functional safety

part of the overall safety relating to the EUC and the EUC control system which depends on the correct functioning of the E/E/PE safety-related systems, other technology safety-related systems and external risk reduction facilities

3.2.6 [8]
electrical/electronic/programmable electronic (E/E/PE)
based on electrical (E) and/or electronic (E) and/or programmable electronic (PE) technology

NOTE – The term is intended to cover any and all devices or systems operating on electrical principles.
EXAMPLE Electrical/electronic/programmable electronic devices include

  • electromechanical devices (electrical);
  • solid-state non-programmable electronic devices (electronic);
  • electronic devices based on computer technology (programmable electronic); see 3.2.5

3.5.1 [8]
safety function
function to be implemented by an E/E/PE safety-related system, other technology safety related system or external risk reduction facilities, which is intended to achieve or maintain a safe state for the EUC, in respect of a specific hazardous event (see 3.4.1)

3.5.2 [8]
safety integrity
probability of a safety-related system satisfactorily performing the required safety functions under all the stated conditions within a stated period of time

NOTE 1 – The higher the level of safety integrity of the safety-related systems, the lower the probability that the safety-related systems will fail to carry out the required safety functions.
NOTE 2 – There are four levels of safety integrity for systems (see 3.5.6).

3.5.6 [8]
safety integrity level (SIL)
discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest

NOTE – The target failure measures (see 3.5.13) for the four safety integrity levels are specified in tables 2 and 3 of IEC 61508-1.

3.6.3 [8]
fault tolerance
ability of a functional unit to continue to perform a required function in the presence of faults or errors

NOTE – The definition in IEV 191-15-05 refers only to sub-item faults. See the note for the term fault in 3.6.1.
[ISO/IEC 2382-14-04-061]

3.1.1 [6]
safety–related part of a control system (SRP/CS)
part of a control system that responds to safety-related input signals and generates safety-related output signals

NOTE 1 The combined safety-related parts of a control system start at the point where the safety-related input signals are initiated (including, for example, the actuating cam and the roller of the position switch) and end at the output of the power control elements (including, for example, the main contacts of a contactor).
NOTE 2 If monitoring systems are used for diagnostics, they are also considered as SRP/CS.

3.1.2 [6]
classification of the safety-related parts of a control system in respect of their resistance to faults and their subsequent behaviour in the fault condition, and which is achieved by the structural arrangement of the parts, fault detection and/or by their reliability

3.1.3 [6]
state of an item characterized by the inability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to lack of external resources

NOTE 1 A fault is often the result of a failure of the item itself, but may exist without prior failure.
[IEC 60050-191:1990, 05-01]
NOTE 2 In this part of ISO 13849, “fault” means random fault.

3.1.4 [6]
termination of the ability of an item to perform a required function

NOTE 1 After a failure, the item has a fault.
NOTE 2 “Failure” is an event, as distinguished from “fault”, which is a state.
NOTE 3 The concept as defined does not apply to items consisting of software only.
[IEC 60050–191:1990, 04-01]
NOTE 4 Failures which only affect the availability of the process under control are outside of the scope of this part of ISO 13849.

3.1.5 [6]
dangerous failure
failure which has the potential to put the SRP/CS in a hazardous or fail-to-function state

NOTE 1 Whether or not the potential is realized can depend on the channel architecture of the system; in redundant systems a dangerous hardware failure is less likely to lead to the overall dangerous or fail-to-function state.
NOTE 2 Adapted from IEC 61508-4:1998, definition 3.6.7.

3.1.20 [6]
safety function
function of the machine whose failure can result in an immediate increase of the risk(s)
[ISO 12100-1:2003, 3.28]

3.1.21 [6]
safety function which ensures that a protective measure is initiated if the ability of a component or an element to perform its function is diminished or if the process conditions are changed in such a way that a decrease of the amount of risk reduction is generated

3.1.22 [6]
programmable electronic system (PES)
system for control, protection or monitoring dependent for its operation on one or more programmable electronic devices, including all elements of the system such as power supplies, sensors and other input devices, contactors and other output devices

NOTE Adapted from IEC 61508-4:1998, definition 3.3.2.

3.1.23 [6]
performance level (PL)
discrete level used to specify the ability of safety-related parts of control systems to perform a safety function under foreseeable conditions

NOTE See 4.5.1.

3.1.25 [6]
mean time to dangerous failure (MTTFd)
expectation of the mean time to dangerous failure

NOTE Adapted from IEC 62061:2005, definition 3.2.34.

3.1.26 [6]
diagnostic coverage (DC)
measure of the effectiveness of diagnostics, which may be determined as the ratio between the failure rate of detected dangerous failures and the failure rate of total dangerous failures

NOTE 1 Diagnostic coverage can exist for the whole or parts of a safety-related system. For example, diagnostic coverage could exist for sensors and/or logic system and/or final elements.
NOTE 2 Adapted from IEC 61508-4:1998, definition 3.8.6.

3.1.33 [6]
safety integrity level (SIL)
discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest

[IEC 61508-4:1998, 3.5.6]


Thanks to my colleagues Derek Jones and Jonathan Johnson, both from Rockwell Automation, and members of ISO TC199. Their suggestion to reference ISO 14119 clause 8.2 was the seed for this article.

I’d also like to acknowledge Ronald Sykes, Howard Touski, Mirela Moga, Michael Roland, and Grant Rider for asking the questions that lead to this article.


[1]     Safety of machinery — General principles for design — Risk assessment and risk reduction. ISO 12100. International Organization for Standardization (ISO). Geneva 2010.

[2]    Safeguarding of Machinery. CSA Z432. Canadian Standards Association. Toronto. 2004.

[3]    Safety of machinery – Emergency stop – Principles for design. ISO 13850. International Organization for Standardization (ISO). Geneva 2006.

[4]    Electrical Standard for Industrial Machinery. NFPA 79. National Fire Protection Association (NFPA). Batterymarch Park. 2015

[5]    Safety of machinery – Electrical equipment of machines – Part 1: General requirements. IEC 60204-1. International Electrotechnical Commission (IEC). Geneva. 2009.

[6]    Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design.  ISO 13849-1. International Organization for Standardization (ISO). Geneva. 2006.

[7]    Safety of machinery — Risk assessment — Part 2: Practical guidance and examples of methods. ISO/TR 14121-2. International Organization for Standardization (ISO). Geneva. 2012.

[8]   Safety of machinery – Functional safety of safety-related electrical, electronic and programmable electronic control systems. IEC 62061. International Electrotechnical Commission (IEC). Geneva. 2005.

[9]    D. J. Wilkins (2002, November). “The Bathtub Curve and Product Failure Behavior. Part One – The Bathtub Curve, Infant Mortality and Burn-in”. Reliability Hotline [Online]. Available: http://www.weibull.com/hotwire/issue21/hottopics21.htm. [Accessed: 26-Apr-2015].

[10] Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 4: Definitions and abbreviations. IEC 61508-4. International Electrotechnical Commission (IEC). Geneva. 1998.

[11] Safety of machinery — Interlocking devices associated with guards — Principles for design and selection. ISO 14119. International Organization for Standardization (ISO). Geneva. 2013.

Sources for Standards


Canadian Standards Association sells CSA, ISO and IEC standards to the Canadian Market.


NSSN: National Standards Search Engine powered by ANSI offers standards from most US Standards Development Organizations. They also sell ISO and IEC standards into the US market.


International Organization for Standardization (ISO).

International Electrotechnical Commission (IEC).