ISO 13849-1 Analysis — Part 6: CCF — Common Cause Failures

This entry is part 6 of 6 in the series How to do a 13849-1 analysis

What is a Common Cause Failure?

There are two similar-sounding terms that people often get confused: Common Cause Failure (CCF) and Common Mode Failure. While these two types of failures sound similar, they are different. A Common Cause Failure is a failure in a system where two or more portions of the system fail at the same time from a single common cause. An example could be a lightning strike that causes a contactor to weld and simultaneously takes out the safety relay processor that controls the contactor. Common cause failures are therefore two different manners of failure in two different components, but with a single cause.

Common Mode Failure is where two components or portions of a system fail in the same way, at the same time. For example, two interposing relays both fail with welded contacts at the same time. The failures could be caused by the same cause or from different causes, but the way the components fail is the same.

Common-cause failure includes common mode failure, since a common cause can result in a common manner of failure in identical devices used in a system.

Here are the formal definitions of these terms:

3.1.6 common cause failure CCF

failures of different items, resulting from a single event, where these failures are not consequences of each other

Note 1 to entry: Common cause failures should not be confused with common mode failures (see ISO 12100:2010, 3.36). [SOURCE: IEC 60050?191-am1:1999, 04-23.] [1]

 

3.36 common mode failures

failures of items characterized by the same fault mode

NOTE Common mode failures should not be confused with common cause failures, as the common mode failures can result from different causes. [lEV 191-04-24] [3]

The “common mode” failure definition uses the phrase “fault mode”, so let’s look at that as well:

failure mode
DEPRECATED: fault mode
manner in which failure occurs

Note 1 to entry: A failure mode may be defined by the function lost or other state transition that occurred. [IEV 192-03-17] [17]

As you can see, “fault mode” is no longer used, in favour of the more common “failure mode”, so it is possible to re-write the common-mode failure definition to read, “failures of items characterised by the same manner of failure.”

Random, Systematic and Common Cause Failures

Why do we need to care about this? There are three manners in which failures occur: random failures, systematic failures, and common cause failures. When developing safety related controls, we need to consider all three and mitigate them as much as possible.

Random failures do not follow any pattern, occurring randomly over time, and are often brought on by over-stressing the component, or from manufacturing flaws. Random failures can increase due to environmental or process-related stresses, like corrosion, EMI, normal wear-and-tear, or other over-stressing of the component or subsystem. Random failures are often mitigated through selection of high-reliability components [18].

Systematic failures include common-cause failures, and occur because some human behaviour occurred that was not caught by procedural means. These failures are due to design, specification, operating, maintenance, and installation errors. When we look at systematic errors, we are looking for things like training of the system designers, or quality assurance procedures used to validate the way the system operates. Systematic failures are non-random and complex, making them difficult to analyse statistically. Systematic errors are a significant source of common-cause failures because they can affect redundant devices, and because they are often deterministic, occurring whenever a set of circumstances exist.

Systematic failures include many types of errors, such as:

  • Manufacturing defects, e.g., software and hardware errors built into the device by the manufacturer.
  • Specification mistakes, e.g. incorrect design basis and inaccurate software specification.
  • Implementation errors, e.g., improper installation, incorrect programming, interface problems, and not following the safety manual for the devices used to realise the safety function.
  • Operation and maintenance, e.g., poor inspection, incomplete testing and improper bypassing [18].

Diverse redundancy is commonly used to mitigate systematic failures, since differences in component or subsystem design tend to create non-overlapping systematic failures, reducing the likelihood of a common error creating a common-mode failure. Errors in specification, implementation, operation and maintenance are not affected by diversity.

Fig 1 below shows the results of a small study done by the UK’s Health and Safety Executive in 1994 [19] that supports the idea that systematic failures are a significant contributor to safety system failures. The study included only 34 systems (n=34), so the results cannot be considered conclusive. However, there were some startling results. As you can see, errors in the specification of the safety functions (Safety Requirement Specification) resulted in about 44% of the system failures in the study. Based on this small sample, systematic failures appear to be a significate source of failures.

Pie chart illustrating the proportion of failures in each phase of the life cycle of a machine, based on data taken from HSE Report HSG238.
Figure 1 – HSG 238 Primary Causes of Failure by Life Cycle Stage

Handling CCF in ISO 13849-1

Now that we understand WHAT Common-Cause Failure is, and WHY it’s important, we can talk about HOW it is handled in ISO 13849-1. Since ISO 13849-1 is intended to be a simplified functional safety standard, CCF analysis is limited to a checklist in Annex F, Table F.1. Note that Annex F is informative, meaning that it is guidance material to help you apply the standard. Since this is the case, you could use any other means suitable for assessing CCF mitigation, like those in IEC 61508, or in other standards.

Table F.1 is set up with a series of mitigation measures which are grouped together in related categories. Each group is provided with a score that can be claimed if you have implemented the mitigations in that group. ALL OF THE MEASURES in each group must be fulfilled in order to claim the points for that category. Here’s an example:

A portion of ISO 13849-1 Table F.1.
ISO 13849-1:2015, Table F.1 Excerpt

In order to claim the 20 points available for the use of separation or segregation in the system design, there must be a separation between the signal paths. Several examples of this are given for clarity.

Table F.1 lists six groups of mitigation measures. In order to claim adequate CCF mitigation, a minimum score of 65 points must be achieved. Only Category 2, 3 and 4 architectures are required to meet the CCF requirements in order to claim the PL, but without meeting the CCF requirement you cannot claim the PL, regardless of whether the design meets the other criteria or not.

One final note on CCF: If you are trying to review an existing control system, say in an existing machine, or in a machine designed by a third party where you have no way to determine the experience and training of the designers or the capability of the company’s change management process, then you cannot adequately assess CCF [8]. This fact is recognised in CSA Z432-16 [20], chapter 8. [20] allows the reviewer to simply verify that the architectural requirements, exclusive of any probabilistic requirements, have been met. This is particularly useful for engineers reviewing machinery under Ontario’s Pre-Start Health and Safety requirements [21], who are frequently working with less-than-complete design documentation.

In case you missed the first part of the series, you can read it here. In the next article in this series, I’m going to review the process flow for system analysis as currently outlined in ISO 13849-1. Watch for it!

Book List

Here are some books that I think you may find helpful on this journey:

[0]     B. Main, Risk Assessment: Basics and Benchmarks, 1st ed. Ann Arbor, MI USA: DSE, 2004.

[0.1]  D. Smith and K. Simpson, Safety critical systems handbook. Amsterdam: Elsevier/Butterworth-Heinemann, 2011.

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

[0.3]  Overview of techniques and measures related to EMC for Functional Safety, 1st ed. Stevenage, UK: Overview of techniques and measures related to EMC for Functional Safety, 2013.

References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. The complete reference list is included in the last post of the series.

[1]     Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. 3rd Edition. ISO Standard 13849-1. 2015.

[2]     Safety of machinery — Safety-related parts of control systems — Part 2: Validation. 2nd Edition. ISO Standard 13849-2. 2012.

[3]      Safety of machinery — General principles for design — Risk assessment and risk reduction. ISO Standard 12100. 2010.

[8]     S. Jocelyn, J. Baudoin, Y. Chinniah, and P. Charpentier, “Feasibility study and uncertainties in the validation of an existing safety-related control circuit with the ISO 13849-1:2006 design standard,” Reliab. Eng. Syst. Saf., vol. 121, pp. 104–112, Jan. 2014.

[17]      “failure mode”, 192-03-17, International Electrotechnical Vocabulary. IEC International Electrotechnical Commission, Geneva, 2015.

[18]      M. Gentile and A. E. Summers, “Common Cause Failure: How Do You Manage Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331–338, 2006.

[19]     Out of Control—Why control systems go wrong and how to prevent failure, 2nd ed. Richmond, Surrey, UK: HSE Health and Safety Executive, 2003.

[20]     Safeguarding of Machinery. 3rd Edition. CSA Standard Z432. 2016.

[21]     O. Reg. 851, INDUSTRIAL ESTABLISHMENTS. Ontario, Canada, 1990.

Testing Emergency Stop Systems

This entry is part 11 of 11 in the series Emergency Stop

Emergency Stop on machine consoleI’ve had a number of questions from readers regarding testing of emergency stop systems, and particularly with the frequency of testing. I addressed the types of tests that might be needed in another article covering Checking Emergency Stop Systems. This article will focus on the frequency of testing rather than the types of tests.

The Problem

Emergency stop systems are considered to be “complementary protective measures” in key machinery safety standards like ISO 12100 [1], and CSA Z432 [2]; this makes emergency stop systems the backup to the primary safeguards. Complementary protective measures are intended to permit “avoiding or limiting the harm” that may result from an emergent situation. By definition, this is a situation that has not been foreseen by the machine builder, or is the result of another failure. This could be a failure of another safeguarding system, or a failure in the machine that is not controlled by other means., e.g., a workpiece shatters due to a material flaw, and the broken pieces damage the machine, creating new, uncontrolled, failure conditions in the machine.

Emergency stop systems are manually triggered, and usually infrequently used. The lack of use means that functional testing of the system doesn’t happen in the normal course of operation of the machinery. Some types of faults may occur and remain undetected until the system is actually used, i.e., contact blocks falling off the back of the operator device. Failure at that point may be catastrophic, since by implication the primary safeguards have already failed, and thus the failure of the backup eliminates the possibility of avoiding or limiting harm.

To understand the testing requirements, it’s important to understand the risk and reliability requirements that drive the design of emergency stop systems, and then get into the test frequency question.

Requirements

In the past, there were no explicit requirements for emergency stop system reliability. Details like the colour of the operator device, or the way the stop function worked were defined in ISO 13850 [3], NFPA 79 [4], and IEC 60204-1 [5]. In the soon-to-be published 3rd edition of ISO 13850, a new provision requiring emergency stop systems to meet at least PLc will be added [6], but until publication, it is up to the designer to determine the safety integrity level, either PL or SIL, required. To determine the requirements for any safety function, the key is to start at the risk assessment. The risk assessment process requires that the designer understand the stage in the life cycle of the machine, the task(s) that will be done, and the specific hazards that a worker may be exposed to while conducting the task. This can become quite complex when considering maintenance and service tasks, and also applies to foreseeable failure modes of the machinery or the process. The scoring or ranking of risk can be accomplished using any suitable risk scoring tool that meets the minimum requirements in [1]. There are some good examples given in ISO/TR 14121-2 [7] if you are looking for some guidance. There are many good engineering textbooks available as well. Have a look at our Book List for some suggestions if you want a deeper dive.

Reliability

Once the initial unmitigated risk is understood, risk control measures can be specified. Wherever the control system is used as part of the risk control measure, a safety function must be specified. Specification of the safety function includes the Performance Level (PL), architectural category (B, 1-4), Mean Time to Dangerous Failure (MTTFd), and Diagnostic Coverage (DC) [6], or Safety Integrity Level (SIL), and Hardware Fault Tolerance (HFT), as described in IEC 62061 [8], as a minimum. If you are unfamiliar with these terms, see the definitions at the end of the article.

Referring to Figure 1, the “Risk Graph” [6, Annex A], we can reasonably state that for most machinery, a failure mode or emergent condition is likely to create conditions where the severity of injury is likely to require more than basic first aid, so selecting “S2” is the first step. In these situations, and particularly where the failure modes are not well understood, the highest level of severity of injury, S2, is selected because we don’t have enough information to expect that the injuries would only be minor. As soon as we make this selection, it is no longer possible to select any combination of Frequency or Probability parameters that will result in anything lower than PLc.

It’s important to understand that Figure 1 is not a risk assessment tool, but rather a decision tree used to select an appropriate PL based on the relevant risk parameters. Those parameters are:

Table 1 – Risk Parameters
Severity of Injury frequency and/or exposure to hazard possibility of avoiding hazard or limiting harm
S1 – slight (normally reversible injury) F1 – seldom-to-less-often and/or exposure time is short P1 – possible under specific conditions
S2 – serious (normally irreversible injury or death) F2 – frequent-to-continuous and/or exposure time is long P2 – scarcely possible
Decision tree used to determine PL based on risk parameters.
Figure 1 – “Risk Graph” for determining PL

PLc can be accomplished using any of three architectures: Category 1, 2, or 3. If you are unsure about what these architectures represent, have a look at my series covering this topic.

Category 1 is single channel, and does not include any diagnostics. A single fault can cause the loss of the safety function (i.e., the machine still runs even though the e-stop button is pressed). Using Category 1, the reliability of the design is based on the use of highly reliable components and well-tried safety principles. This approach can fail to danger.

Category 2 adds some diagnostic capability to the basic single channel configuration, and does not require the use of “well-tried” components. This approach can also fail to danger.

Category 3 architecture adds a redundant channel, and includes diagnostic coverage. Category 3 is not subject to failure due to single faults and is called “single-fault tolerant”. This approach is less likely to fail to danger, but still can in the presence of multiple, undetected, faults.

A key concept in reliability is the “fault”. This can be any kind of defect in hardware or software that results in unwanted behaviour or a failure. Faults are further broken down into dangerous and safe faults, meaning those that result in a dangerous outcome, and those that do not. Finally, each of these classes is broken down into detectable and undetectable faults. I’m not going to get into the mathematical treatment of these classes, but my point is this: there are undetectable dangerous faults. These are faults that cannot be detected by built-in diagnostics. As designers, we try to design the control system so that the undetectable dangerous faults are extremely rare, ideally the probability should be much less than once in the lifetime of the machine.

What is the lifetime of the machine? The standards writers have settled on a default lifetime of 20 years, thus the answer is that undetectable dangerous failures should happen much less than once in twenty years of 24/7/365 operation. So why does this matter? Each architectural category has different requirements for testing. The test rates are driven by the “Demand Rate”. The Demand Rate is defined in [6]. “SRP/CS” stands for “Safety Related Part of the Control System” in the definition:

3.1.30
demand rate (rd) – frequency of demands for a safety-related action of the SRP/CS

Each time the emergency stop button is pressed, a “demand” is put on the system. Looking at the “Simplified Procedure for estimating PL”, [6, 4.5.4], we find that the standard makes the following assumptions:

  • mission time, 20 years (see Clause 10);
  • constant failure rates within the mission time;
  • for category 2, demand rate <= 1/100 test rate;
  • for category 2, MTTFd,TE larger than half of MTTFd,L.

NOTE When blocks of each channel cannot be separated, the following can be applied: MTTFd of the summarized test channel (TE, OTE) larger than half MTTFd of the summarized functional channel (I, L, O).

So what does all that mean? The 20-year mission time is the assumed lifetime of the machinery. This number underpins the rest of the calculations in the standard, and is based on the idea that few modern control systems last longer than 20 years without being replaced or rebuilt. The constant failure rate points at the idea that systems used in the field will have components and controls that are not subject to infant mortality, nor are they old enough to start to fail due to age, but rather that the system is operating in the flat portion of the standardized failure rate “bathtub curve”, [9]. See Figure 2. Components that are subject to infant mortality failed at the factory and were removed from the supply chain. Those failing from “wear-out” are expected to reach that point after 20 years. If this is not the case, then the maintenance instructions for the system should include preventative maintenance tasks that require replacing critical components before they reach the predicted MTTFd.

Diagram of a standardized bathtub-shaped failure rate curve.
Figure 2 – Weibull Bathtub Curve [9]
For systems using Category 2 architecture, the automatic diagnostic test rate must be at least 100x the demand rate. Keep in mind that this test rate is normally accomplished automatically in the design of the controls, and is only related to the detectable safe or dangerous faults. Undetectable faults must have a probability of less than once in 20 years, and should be detected by the “proof test”. More on that a bit later.

Finally, the MTTFd of the functional channel must be at least twice that of the diagnostic system.

Category 1 has no diagnostics, so there is no guidance in [6] to help us out with these systems. Category 3 is single fault tolerant, so as long as we don’t have multiple undetected faults we can count on the system to function and to alert us when a single fault occurs; remember that the automatic tests may not be able to detect every fault. This is where the “proof test” comes in. What is a proof test? To find a definition for proof test, we have to look at IEC 61508-4 [10]:

3.8.5
proof test
periodic test performed to detect failures in a safety-related system so that, if necessary, the system can be restored to an “as new” condition or as close as practical to this condition

NOTE – The effectiveness of the proof test will be dependent upon how close to the “as new” condition the system is restored. For the proof test to be fully effective, it will be necessary to detect 100 % of all dangerous failures. Although in practice 100 % is not easily achieved for other than low-complexity E/E/PE safety-related systems, this should be the target. As a minimum, all the safety functions which are executed are checked according to the E/E/PES safety requirements specification. If separate channels are used, these tests are done for each channel separately.

The 20-year life cycle assumption used in the standards also applies to proof testing. Machine controls are assumed to get at least one proof test in their life time. The proof test should be designed to detect faults that the automatic diagnostics cannot detect. Proof tests are also conducted after major rebuilds and repairs to ensure that the system operates correctly.

If you know the architecture of the emergency stop control system, you can determine the test rate based on the demand rate. It would be considerably easier if the standards just gave us some minimum test rates for the various architectures. One standard, ISO 14119 [11] on interlocks does just that. Admittedly, this standard does not include emergency stop functions within its scope, as its focus is on interlocks, but since interlocking systems are more critical than the complementary protective measures that back them up, it would be reasonable to apply these same rules. Looking at the clause on Assessment of Faults, [9, 8.2], we find this guidance:

For applications using interlocking devices with automatic monitoring to achieve the necessary diagnostic coverage for the required safety performance, a functional test (see IEC 60204-1:2005, 9.4.2.4) can be carried out every time the device changes its state, e.g. at every access. If, in such a case, there is only infrequent access, the interlocking device shall be used with additional measures, because between consecutive functional tests the probability of occurrence of an undetected fault is increased.

When a manual functional test is necessary to detect a possible accumulation of faults, it shall be made within the following test intervals:

  • at least every month for PL e with Category 3 or Category 4 (according to ISO 13849-1) or SIL 3 with HFT (hardware fault tolerance) = 1 (according to IEC 62061);
  • at least every 12 months for PL d with Category 3 (according to ISO 13849-1) or SIL 2 with HFT (hardware fault tolerance) = 1 (according to IEC 62061).

NOTE It is recommended that the control system of a machine demands these tests at the required intervals e.g. by visual display unit or signal lamp. The control system should monitor the tests and stop the machine if the test is omitted or fails.

In the preceding, HFT=1 is equivalent to saying that the system is single-fault tolerant.

This leaves us then with recommended test frequencies for Category 2 and 3 architectures in PLc, PLd, and PLe, or for SIL 2 and 3 with HFT=1. We still don’t have a test frequency for PLc, Category 1 systems. There is no explicit guidance for these systems in the standards. How can we determine a test rate for these systems?

My approach would be to start by examining the MTTFd values for all of the subsystems and components. [6] requires that the system have HIGH MTTFd value, meaning 30 years <= MTTFd <= 100 years [6, Table 5]. If this is the case, then the once-in-20-years proof test is theoretically enough. If the system is constructed, for example, as shown Figure 2 below, then each component would have to have an MTTFd > 120 years. See [6, Annex C] for this calculation.

Basic Stop/Start Circuit
Figure 2 – Basic Stop/Start Circuit

PB1 – Emergency Stop Button

PB2 – Power “ON” Button

MCR – Master Control Relay

MOV – Surge Suppressor on MCR Coil

M1 – Machine prime-mover (motor)

Note that the fuses are not included, since they can only fail to safety, and assuming that they were specified correctly in the original design, are not subject to the same cyclical aging effects as the other components.

M1 is not included, since it is the controlled portion of the machine and is not part of the control system.

If a review of the components in the system shows that any single component falls below the target MTTFd, then I would consider replacing the system with a higher category design. Since most of these components will be unlikely to have MTTFd values on the spec sheet, you will likely have to convert from total life values (B10). This is outside the scope of this article, but you can find guidance in [6, Annex C]. More frequent testing, i.e., more than once in 20 years, is always acceptable.

Where manual testing is required as part of the design for any category of system, and particularly in Category 1 or 2 systems, the control system should alert the user to the requirement and not permit the machine to operate until the test is completed. This will help to ensure that the requisite tests are properly completed.

Need more information? Leave a comment below, or send me an email with the details of your application!

Definitions

3.1.9 [8]
functional safety

part of the overall safety relating to the EUC and the EUC control system which depends on the correct functioning of the E/E/PE safety-related systems, other technology safety-related systems and external risk reduction facilities

3.2.6 [8]
electrical/electronic/programmable electronic (E/E/PE)
based on electrical (E) and/or electronic (E) and/or programmable electronic (PE) technology

NOTE – The term is intended to cover any and all devices or systems operating on electrical principles.
EXAMPLE Electrical/electronic/programmable electronic devices include

  • electromechanical devices (electrical);
  • solid-state non-programmable electronic devices (electronic);
  • electronic devices based on computer technology (programmable electronic); see 3.2.5

3.5.1 [8]
safety function
function to be implemented by an E/E/PE safety-related system, other technology safety related system or external risk reduction facilities, which is intended to achieve or maintain a safe state for the EUC, in respect of a specific hazardous event (see 3.4.1)

3.5.2 [8]
safety integrity
probability of a safety-related system satisfactorily performing the required safety functions under all the stated conditions within a stated period of time

NOTE 1 – The higher the level of safety integrity of the safety-related systems, the lower the probability that the safety-related systems will fail to carry out the required safety functions.
NOTE 2 – There are four levels of safety integrity for systems (see 3.5.6).

3.5.6 [8]
safety integrity level (SIL)
discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest

NOTE – The target failure measures (see 3.5.13) for the four safety integrity levels are specified in tables 2 and 3 of IEC 61508-1.

3.6.3 [8]
fault tolerance
ability of a functional unit to continue to perform a required function in the presence of faults or errors

NOTE – The definition in IEV 191-15-05 refers only to sub-item faults. See the note for the term fault in 3.6.1.
[ISO/IEC 2382-14-04-061]

3.1.1 [6]
safety–related part of a control system (SRP/CS)
part of a control system that responds to safety-related input signals and generates safety-related output signals

NOTE 1 The combined safety-related parts of a control system start at the point where the safety-related input signals are initiated (including, for example, the actuating cam and the roller of the position switch) and end at the output of the power control elements (including, for example, the main contacts of a contactor).
NOTE 2 If monitoring systems are used for diagnostics, they are also considered as SRP/CS.

3.1.2 [6]
category
classification of the safety-related parts of a control system in respect of their resistance to faults and their subsequent behaviour in the fault condition, and which is achieved by the structural arrangement of the parts, fault detection and/or by their reliability

3.1.3 [6]
fault
state of an item characterized by the inability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to lack of external resources

NOTE 1 A fault is often the result of a failure of the item itself, but may exist without prior failure.
[IEC 60050-191:1990, 05-01]
NOTE 2 In this part of ISO 13849, “fault” means random fault.

3.1.4 [6]
failure
termination of the ability of an item to perform a required function

NOTE 1 After a failure, the item has a fault.
NOTE 2 “Failure” is an event, as distinguished from “fault”, which is a state.
NOTE 3 The concept as defined does not apply to items consisting of software only.
[IEC 60050–191:1990, 04-01]
NOTE 4 Failures which only affect the availability of the process under control are outside of the scope of this part of ISO 13849.

3.1.5 [6]
dangerous failure
failure which has the potential to put the SRP/CS in a hazardous or fail-to-function state

NOTE 1 Whether or not the potential is realized can depend on the channel architecture of the system; in redundant systems a dangerous hardware failure is less likely to lead to the overall dangerous or fail-to-function state.
NOTE 2 Adapted from IEC 61508-4:1998, definition 3.6.7.

3.1.20 [6]
safety function
function of the machine whose failure can result in an immediate increase of the risk(s)
[ISO 12100-1:2003, 3.28]

3.1.21 [6]
monitoring
safety function which ensures that a protective measure is initiated if the ability of a component or an element to perform its function is diminished or if the process conditions are changed in such a way that a decrease of the amount of risk reduction is generated

3.1.22 [6]
programmable electronic system (PES)
system for control, protection or monitoring dependent for its operation on one or more programmable electronic devices, including all elements of the system such as power supplies, sensors and other input devices, contactors and other output devices

NOTE Adapted from IEC 61508-4:1998, definition 3.3.2.

3.1.23 [6]
performance level (PL)
discrete level used to specify the ability of safety-related parts of control systems to perform a safety function under foreseeable conditions

NOTE See 4.5.1.

3.1.25 [6]
mean time to dangerous failure (MTTFd)
expectation of the mean time to dangerous failure

NOTE Adapted from IEC 62061:2005, definition 3.2.34.

3.1.26 [6]
diagnostic coverage (DC)
measure of the effectiveness of diagnostics, which may be determined as the ratio between the failure rate of detected dangerous failures and the failure rate of total dangerous failures

NOTE 1 Diagnostic coverage can exist for the whole or parts of a safety-related system. For example, diagnostic coverage could exist for sensors and/or logic system and/or final elements.
NOTE 2 Adapted from IEC 61508-4:1998, definition 3.8.6.

3.1.33 [6]
safety integrity level (SIL)
discrete level (one out of a possible four) for specifying the safety integrity requirements of the safety functions to be allocated to the E/E/PE safety-related systems, where safety integrity level 4 has the highest level of safety integrity and safety integrity level 1 has the lowest

[IEC 61508-4:1998, 3.5.6]

Acknowledgements

Thanks to my colleagues Derek Jones and Jonathan Johnson, both from Rockwell Automation, and members of ISO TC199. Their suggestion to reference ISO 14119 clause 8.2 was the seed for this article.

I’d also like to acknowledge Ronald Sykes, Howard Touski, Mirela Moga, Michael Roland, and Grant Rider for asking the questions that lead to this article.

References

[1]     Safety of machinery — General principles for design — Risk assessment and risk reduction. ISO 12100. International Organization for Standardization (ISO). Geneva 2010.

[2]    Safeguarding of Machinery. CSA Z432. Canadian Standards Association. Toronto. 2004.

[3]    Safety of machinery – Emergency stop – Principles for design. ISO 13850. International Organization for Standardization (ISO). Geneva 2006.

[4]    Electrical Standard for Industrial Machinery. NFPA 79. National Fire Protection Association (NFPA). Batterymarch Park. 2015

[5]    Safety of machinery – Electrical equipment of machines – Part 1: General requirements. IEC 60204-1. International Electrotechnical Commission (IEC). Geneva. 2009.

[6]    Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design.  ISO 13849-1. International Organization for Standardization (ISO). Geneva. 2006.

[7]    Safety of machinery — Risk assessment — Part 2: Practical guidance and examples of methods. ISO/TR 14121-2. International Organization for Standardization (ISO). Geneva. 2012.

[8]   Safety of machinery – Functional safety of safety-related electrical, electronic and programmable electronic control systems. IEC 62061. International Electrotechnical Commission (IEC). Geneva. 2005.

[9]    D. J. Wilkins (2002, November). “The Bathtub Curve and Product Failure Behavior. Part One – The Bathtub Curve, Infant Mortality and Burn-in”. Reliability Hotline [Online]. Available: http://www.weibull.com/hotwire/issue21/hottopics21.htm. [Accessed: 26-Apr-2015].

[10] Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 4: Definitions and abbreviations. IEC 61508-4. International Electrotechnical Commission (IEC). Geneva. 1998.

[11] Safety of machinery — Interlocking devices associated with guards — Principles for design and selection. ISO 14119. International Organization for Standardization (ISO). Geneva. 2013.

Sources for Standards

CANADA

Canadian Standards Association sells CSA, ISO and IEC standards to the Canadian Market.

USA

NSSN: National Standards Search Engine powered by ANSI offers standards from most US Standards Development Organizations. They also sell ISO and IEC standards into the US market.

International

International Organization for Standardization (ISO).

International Electrotechnical Commission (IEC).

Interlock Architectures – Pt. 4: Category 3 – Control Reliable

This entry is part 4 of 8 in the series Circuit Architectures Explored

Category 3 system architecture is the first category that could be considered to have similarity to “Control Reliable” circuits or systems as defined in the North American standards. It is not the same as Control Reliable, but we’ll get to in a subsequent post. If you haven’t read the first three posts in this series, you may want to go back and review them as the concepts in those articles are the basis for the discussion in this post.

So what is “Control Reliable” anyway? This term was coined by the ANSI RIA R15.06 technical committee when they were developing their definitions for control system reliability, first published in the 1999 edition of the standard. No mention of the concept of control reliability appears in the 1994 edition of CSA Z434 or the preceding edition of RIA R15.06.

Essentially, the term “Control Reliable” means that the control system is designed with some degree of fault tolerance. Depending on the definitions that you read, this could be single- or multiple-fault-tolerance.

There are a number of design techniques that can be used to increase the fault tolerance of a control system. The older approaches, such as those given in ANSI RIA R15.06-1999, CSA Z434-03 or EN 954-1:95, rely primarily on the structure or architecture of the circuit, and the characteristics of the components selected for use. ISO 13849-1 uses the same basic architectures defined by EN 954-1:95, and extends them to include diagnostic coverage, common cause failure resistance and an understanding of the failure rate of the components to determine the degree of fault tolerance and reliability provided by the design.

OK, enough background for now! Let’s look at the definition for Category 3 systems. Remember that “SRP/CS” means “Safety Related Parts of the Control System”.

Definition

6.2.6 Category 3

For category 3, the same requirements as those according to 6.2.3 for category B shall apply. “Well-tried safety principles” according to 6.2.4 shall also be followed. In addition, the following applies. SRP/CS of category 3 shall be designed so that a single fault in any of these parts does not lead to the loss of the safety function. Whenever reasonably practicable, the single fault shall be detected at or before the next demand upon the safety function.

The diagnostic coverage (DCavg) of the total SRP/CS including fault-detection shall be low. The MTTFd of each of the redundant channels shall be low-to-high, depending on the PLr. Measures against CCF shall be applied (see Annex F).

NOTE 1 The requirement of single-fault detection does not mean that all faults will be detected. Consequently, the accumulation of undetected faults can lead to an unintended output and a hazardous situation at the machine. Typical examples of practicable measures for fault detection are use of the feedback of mechanically guided relay contacts and monitoring of redundant electrical outputs.

NOTE 2 If necessary because of technology and application, type-C standard makers need to give further details on the detection of faults.

NOTE 3 Category 3 system behaviour allows that

  • when the single fault occurs the safety function is always performed,
  • some but not all faults will be detected,
  • accumulation of undetected faults can lead to the loss of the safety function.

NOTE 4 The technology used will influence the possibilities for the implementation of fault detection.

5% Discount on ISO and IEC Standards with code: CC2011

Breaking it down

Let’s take the definition apart and look at the components that make it up.

For category 3, the same requirements as those according to 6.2.3 for category B shall apply. “Well-tried safety principles” according to 6.2.4 shall also be followed.

The first couple of lines remind the designer of two key points:

  • The components selected must be suitable for the application, i.e. correctly specified for voltage, current, environmental conditions, etc.; and
  • “well-tried safety principles” must be used in the design.

It’s important to note here that we are talking about “well tried safety principles” and NOT “well-tried components“. The requirement to use components designed for safety applications comes from other standards, like EN 1088 and ISO 13850. The requirements from these standards, such as the use of “direct-drive” contacts improves the fault tolerance of the component, and so benefits the design in the end. These improvements are generally reflected in the B10d or MTTFd of the component, and are points that inspectors will commonly look for, since they are easy to spot in the field, since “safety-rated components” often use red or yellow caps to identify them clearly in the control panel.

In addition, the following applies. SRP/CS of category 3 shall be designed so that a single fault in any of these parts does not lead to the loss of the safety function.

This sentence makes the requirement for single-fault tolerance. This means that the failure of any single component in the functional channel cannot result in the loss of the safety function. To meet this requirement, redundancy is needed. With redundant systems, one complete channel can fail without losing the ability to stop the machinery. It is possible to lose the function of the monitoring system from a single component failure, but as long as the system continues to provide the safety function this may be acceptable. The system should not permit itself to be reset if the monitoring system is not working.

One more “gotcha” from this sentence: In order to meet the requirement that any single component failure can be detected, the design will require two separate sensors to detect the position of a gate, for example. This permits the system to detect a failure in either sensor, including mechanical failures like broken keys or attempts to defeat the safety system. You can clearly see this in both the block diagram, which does not show any monitoring connection to the input devices, and in the circuit diagram. Both of these diagrams are shown later in this post. The only way out of the requirement to have redundant sensors is to select a gate switch that is robust enough that mechanical faults can reasonably be excepted. I’ll get into fault exceptions later in this article.

Whenever reasonably practicable, the single fault shall be detected at or before the next demand upon the safety function.

This sentence can be a bit sticky. The phrase “Whenever reasonably practicable” means that your design needs to be able to detect single faults unless it would be “unreasonable” to do so. What constitutes an unreasonable degree of effort? This is for you to decide. I will say that if there is a common, off the shelf component (COTS) available that will do the job, and you choose not to use it, you will have a difficult time convincing a court that you took every reasonably practicable means to detect the fault.

Following the comma, the rest of the sentence provides the designer with the basic requirement for the test system: it must be able to detect a single component failure at the moment of demand (this is usually how it’s done, since this is typically the simplest way) or before it occurs, which can happen if your test equipment has a means to detect a change in some critical characteristic of the monitored component(s).

 The diagnostic coverage (DCavg) of the total SRP/CS including fault-detection shall be low.

This sentence tells you that your design must meet the requirements for LOW Diagnostic Coverage. To get to LOW DCavg, we need to look first at Table 6:

ISO 13849-1:06 Table 6

Diagnostic Coverage (DC)

Denotation  Range
 None  DC < 60%
 Low  60% <= DC < 90%
 Medium  90% <= DC < 99%
 High  99% <= DC
NOTE 1 For SRP/CS consisting of several parts an average value DCavg for DC is used in Figure 5, Clause 6 and E.2.

NOTE 2 The choice of the DC ranges is based on the key values 60 %, 90 % and 99 % also established in other standards (e.g. IEC 61508) dealing with diagnostic coverage of tests. Investigations show that (1 – DC) rather than DC itself is a characteristic measure for the effectiveness of the test. (1 – DC) for the key values 60 %, 90 % and 99 % forms a kind of logarithmic scale fitting to the logarithmic PL-scale. A DC-value less than 60 % has only slight effect on the reliability of the tested system and is therefore called “none”. A DC-value greater than 99 % for complex systems is very hard to achieve. To be practicable, the number of ranges was restricted to four. The indicated borders of this table are assumed within an accuracy of 5 %.

Based on Table 6, the DCavg must be between 60% and 90%, all components considered. To score this, we must go to Annex E and look at Table E1. Using the factors in Table E1, score the design. If you end up in the desired range between 60% and 90% DC coverage, you can move on. If not, the design will require modification to bring it into this range.

The MTTFd of each of the redundant channels shall be low-to-high, depending on the PLr.

This sentence reminds you that your component selections matter. Depending on the PLr you are trying to achieve, you will need to choose components with suitable MTTFd ratings. Remember that just because you are using a Category 3 architecture, you have not automatically achieved the highest levels of reliability. If you refer to Figure 5 in the standard, you can see that a Category 3 architecture can meet a range of PL’s, all the way from PLa through PLe!

ISO 13849-1 Figure 5
ISO 13849-1 Figure 5

If you want, or need, to know the numeric boundaries of each of the bands in the diagram above, look at Annex K of the standard. The full numeric representation of Figure 5 is provided in that Annex.

Measures against CCF shall be applied (see Annex F).

In order for the architecture of your design to meet Category 3 architecture, CCF measures are required. I’ve discussed Common Cause Failures elsewhere on the blog, but as a reminder, a Common Cause Failure is one where a single event, like a lightning strike on the power line, or a cable being cut, results in the failure of the system. This is not the same as a Common Mode Failure, where similar or different components fail in the same way. For instance, if both output contactors were to weld closed either simultaneously or at different time due to overloading because they were undersized, this could be considered to be a Common Mode Failure. If they both weld closed due to a lightning strike, that is a Common Cause Failure.

Annex F provides a checklist that is used to score the CCF of the design. The design must meet at least 65 points to be considered to meet the minimum level of CCF protection, and more is better of course! Score your design and see where you come out. Less than 65 and you need to do more. 65 or more and you are good to go.

The Notes

The notes given in the definition are also important. Note 1 reminds the designer that not all faults will be detected, and an accumulation of undetected faults can lead to the loss of the safety function. Be aware that it is up to you as the designer to minimize the kinds of failures that can accumulate undetected.

Note 2 speaks to the possibility that a Type-C product standard, like EN 201 for injection moulding machines for example, may impose a minimum PLr on the design. Make sure that you get a copy of any Type-C standard that is relevant for your product and market. Note that the designation “Type-C” comes from ISO. If you go looking for this terminology in ANSI or CSA standards, you won’t find it used because the concept doesn’t exist in the same way in these National standards.

Note 3 gives you the basic performance parameters for the design. If your design can do these things, then you’re halfway there.

Finally, Note 4 is a reminder that different kinds of technology have greater or lesser capability to detect failures. More sophisticated technology may be required to achieve the PL level you need.

The Block Diagram

Let’s have a look at the functional block diagram for this Category.

ISO 13849-1 Figure 11By looking at the diagram you can see clearly the two independent channels and the cross-monitoring connection between the channels. Input devices are not monitored, but output devices are monitored. This is another significant reason requiring the use of two physically separate input devices to sense the guard position or whatever other safeguarding device is integrated into the system. The only way that a failure in the input devices can be detected is if one channel changes state and one does not.

If you want to learn more about applying the block diagramming method to you design, there is a good explanation of the method in the SISTEMA Cookbook 1, published by the IFA in Germany. You can download the English version from the link above, or get the document directly from the IFA web site.

Circuit Diagram

By now you probably get the idea that there are as many ways to configure a Category 3 circuit as there are applications. Below is a typical circuit diagram borrowed from Rockwell Allen-Bradley, showing the application of typical safety relays in a complete system that includes the emergency stop system, a gate interlock and a safety mat. You can meet the requirements for Category 3 architecture in other ways, so don’t feel that you must use a COTS safety relay. It just may be the most straightforward way in many cases.

This is not a plug for A-B products. Neither Machinery Safety 101, nor I, have any relationship with Rockwell Allen-Bradley.

From Rockwell Automation publication SAFETY-WD001A-EN-P – June 2011, p.6.

If you’re interested in obtaining the source document containing this diagram, you can download it directly from the Rockwell Automation web site.

Emergency Stop Subsystem

The emergency stop circuit uses the 440R-512R2 relay on the left side of the diagram. This particular system uses Category 3 architecture in the e-stop system, which may be more than is required. A risk assessment and a start-stop analysis is required to determine what performance level is needed for this subsystem. Get more information on emergency stop.

 Gate Interlock Subsystem

The gate interlock circuit is located in the center of the diagram, and uses the 440R-D22R2 relay. As you can see, there are two physically separate gate interlock switches. Only one contact from each switch is used, so one switch is connected to Channel 1, and the other to Channel 2. Notice that there is no other monitoring of these devices (i.e. no second connection to either switch). The secondary contacts on these switches could be connected to the PLC for annunciation purposes. This would allow the PLC to display the open/closed status of the gate on the machine HMI.

The output contactors, K3 and K4, are monitored by the reset loop connected to S34 and the +V rail.

One more interesting point – did you notice that there is a “zone e-stop” included in the gate interlock? If you look immediately below the central safety relay and a little to the left you will find an emergency stop device. This device is wired in series with the gate interlock, so activating it will drop out K3 and K4 but not disturb the operation of the rest of the machine. The safety relay can’t distinguish between the e-stop button and the gate interlocks, so if annunciation is needed, you may want to use a third contact on the e-stop device to connect to a PLC input for this purpose.

Safety Mat Subsystem

The safety mat subsystem is located on the right side of the diagram and uses a second 440R-D22R2 relay. Safety mats can be either single or dual channel in design. The mat show in this drawing is a dual-channel type. Stepping on the mat causes the conductive layers in the mat to touch, shorting Channel 1 to Channel 2. This creates an input fault that will be detected by the 440R relay. The fault condition will cause the output of the relay to open, stopping the machine.

Safety mats can be damaged reasonably easily, and the circuit design shown will detect shorts or opens within the mat and will prevent the hazardous motion from starting or continuing.

The output contactors, K5 and K6 are monitored by the relay reset loop connected to S34 and the +V rail.

This circuit also includes a conventional start-stop circuit that doesn’t rely on the safety relay.

One more thing – just like the gate interlock circuit, this circuit also includes a “zone e-stop”. Look below and to the left of the safety mat relay. As with the gate interlock, pressing this button will drop out K5 and K6, stopping the same motions protected by the safety mat. Since the relay can’t tell the difference between the e-stop button and the mat being activated, you may want to use the same approach and add a third contact to the e-stop button, connecting it to the PLC for annunciation.

Component Selection

The components used in the circuit are critical to the final PL rating of the design. The final PL of the design depends on the MTTFd of the components used in each channel. No knowledge of the internal construction of the safety relays is needed, because the relays come with a PL rating from the manufacturer. They can be treated as a subsystem unto themselves. The selection of the input and output devices is then the significant factor. Component data sheets can be downloaded from the Rockwell site if you want to dig a bit deeper.

What did you think about this article? What questions came to mind that weren’t answered for you? I look forward to hearing your thoughts and questions!

Digiprove sealCopyright secured by Digiprove © 2011-2014
Acknowledgements: ISO for excerpts from ISO 13849-1 and more...
Some Rights Reserved