ISO 13849-1 Analysis — Part 6: CCF — Common Cause Failures

This entry is part 6 of 6 in the series How to do a 13849-1 analysis

What is a Common Cause Failure?

There are two similar-sounding terms that people often get confused: Common Cause Failure (CCF) and Common Mode Failure. While these two types of failures sound similar, they are different. A Common Cause Failure is a failure in a system where two or more portions of the system fail at the same time from a single common cause. An example could be a lightning strike that causes a contactor to weld and simultaneously takes out the safety relay processor that controls the contactor. Common cause failures are therefore two different manners of failure in two different components, but with a single cause.

Common Mode Failure is where two components or portions of a system fail in the same way, at the same time. For example, two interposing relays both fail with welded contacts at the same time. The failures could be caused by the same cause or from different causes, but the way the components fail is the same.

Common-cause failure includes common mode failure, since a common cause can result in a common manner of failure in identical devices used in a system.

Here are the formal definitions of these terms:

3.1.6 common cause failure CCF

failures of different items, resulting from a single event, where these failures are not consequences of each other

Note 1 to entry: Common cause failures should not be confused with common mode failures (see ISO 12100:2010, 3.36). [SOURCE: IEC 60050?191-am1:1999, 04-23.] [1]

 

3.36 common mode failures

failures of items characterized by the same fault mode

NOTE Common mode failures should not be confused with common cause failures, as the common mode failures can result from different causes. [lEV 191-04-24] [3]

The “common mode” failure definition uses the phrase “fault mode”, so let’s look at that as well:

failure mode
DEPRECATED: fault mode
manner in which failure occurs

Note 1 to entry: A failure mode may be defined by the function lost or other state transition that occurred. [IEV 192-03-17] [17]

As you can see, “fault mode” is no longer used, in favour of the more common “failure mode”, so it is possible to re-write the common-mode failure definition to read, “failures of items characterised by the same manner of failure.”

Random, Systematic and Common Cause Failures

Why do we need to care about this? There are three manners in which failures occur: random failures, systematic failures, and common cause failures. When developing safety related controls, we need to consider all three and mitigate them as much as possible.

Random failures do not follow any pattern, occurring randomly over time, and are often brought on by over-stressing the component, or from manufacturing flaws. Random failures can increase due to environmental or process-related stresses, like corrosion, EMI, normal wear-and-tear, or other over-stressing of the component or subsystem. Random failures are often mitigated through selection of high-reliability components [18].

Systematic failures include common-cause failures, and occur because some human behaviour occurred that was not caught by procedural means. These failures are due to design, specification, operating, maintenance, and installation errors. When we look at systematic errors, we are looking for things like training of the system designers, or quality assurance procedures used to validate the way the system operates. Systematic failures are non-random and complex, making them difficult to analyse statistically. Systematic errors are a significant source of common-cause failures because they can affect redundant devices, and because they are often deterministic, occurring whenever a set of circumstances exist.

Systematic failures include many types of errors, such as:

  • Manufacturing defects, e.g., software and hardware errors built into the device by the manufacturer.
  • Specification mistakes, e.g. incorrect design basis and inaccurate software specification.
  • Implementation errors, e.g., improper installation, incorrect programming, interface problems, and not following the safety manual for the devices used to realise the safety function.
  • Operation and maintenance, e.g., poor inspection, incomplete testing and improper bypassing [18].

Diverse redundancy is commonly used to mitigate systematic failures, since differences in component or subsystem design tend to create non-overlapping systematic failures, reducing the likelihood of a common error creating a common-mode failure. Errors in specification, implementation, operation and maintenance are not affected by diversity.

Fig 1 below shows the results of a small study done by the UK’s Health and Safety Executive in 1994 [19] that supports the idea that systematic failures are a significant contributor to safety system failures. The study included only 34 systems (n=34), so the results cannot be considered conclusive. However, there were some startling results. As you can see, errors in the specification of the safety functions (Safety Requirement Specification) resulted in about 44% of the system failures in the study. Based on this small sample, systematic failures appear to be a significate source of failures.

Pie chart illustrating the proportion of failures in each phase of the life cycle of a machine, based on data taken from HSE Report HSG238.
Figure 1 – HSG 238 Primary Causes of Failure by Life Cycle Stage

Handling CCF in ISO 13849-1

Now that we understand WHAT Common-Cause Failure is, and WHY it’s important, we can talk about HOW it is handled in ISO 13849-1. Since ISO 13849-1 is intended to be a simplified functional safety standard, CCF analysis is limited to a checklist in Annex F, Table F.1. Note that Annex F is informative, meaning that it is guidance material to help you apply the standard. Since this is the case, you could use any other means suitable for assessing CCF mitigation, like those in IEC 61508, or in other standards.

Table F.1 is set up with a series of mitigation measures which are grouped together in related categories. Each group is provided with a score that can be claimed if you have implemented the mitigations in that group. ALL OF THE MEASURES in each group must be fulfilled in order to claim the points for that category. Here’s an example:

A portion of ISO 13849-1 Table F.1.
ISO 13849-1:2015, Table F.1 Excerpt

In order to claim the 20 points available for the use of separation or segregation in the system design, there must be a separation between the signal paths. Several examples of this are given for clarity.

Table F.1 lists six groups of mitigation measures. In order to claim adequate CCF mitigation, a minimum score of 65 points must be achieved. Only Category 2, 3 and 4 architectures are required to meet the CCF requirements in order to claim the PL, but without meeting the CCF requirement you cannot claim the PL, regardless of whether the design meets the other criteria or not.

One final note on CCF: If you are trying to review an existing control system, say in an existing machine, or in a machine designed by a third party where you have no way to determine the experience and training of the designers or the capability of the company’s change management process, then you cannot adequately assess CCF [8]. This fact is recognised in CSA Z432-16 [20], chapter 8. [20] allows the reviewer to simply verify that the architectural requirements, exclusive of any probabilistic requirements, have been met. This is particularly useful for engineers reviewing machinery under Ontario’s Pre-Start Health and Safety requirements [21], who are frequently working with less-than-complete design documentation.

In case you missed the first part of the series, you can read it here. In the next article in this series, I’m going to review the process flow for system analysis as currently outlined in ISO 13849-1. Watch for it!

Book List

Here are some books that I think you may find helpful on this journey:

[0]     B. Main, Risk Assessment: Basics and Benchmarks, 1st ed. Ann Arbor, MI USA: DSE, 2004.

[0.1]  D. Smith and K. Simpson, Safety critical systems handbook. Amsterdam: Elsevier/Butterworth-Heinemann, 2011.

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

[0.3]  Overview of techniques and measures related to EMC for Functional Safety, 1st ed. Stevenage, UK: Overview of techniques and measures related to EMC for Functional Safety, 2013.

References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. The complete reference list is included in the last post of the series.

[1]     Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. 3rd Edition. ISO Standard 13849-1. 2015.

[2]     Safety of machinery — Safety-related parts of control systems — Part 2: Validation. 2nd Edition. ISO Standard 13849-2. 2012.

[3]      Safety of machinery — General principles for design — Risk assessment and risk reduction. ISO Standard 12100. 2010.

[8]     S. Jocelyn, J. Baudoin, Y. Chinniah, and P. Charpentier, “Feasibility study and uncertainties in the validation of an existing safety-related control circuit with the ISO 13849-1:2006 design standard,” Reliab. Eng. Syst. Saf., vol. 121, pp. 104–112, Jan. 2014.

[17]      “failure mode”, 192-03-17, International Electrotechnical Vocabulary. IEC International Electrotechnical Commission, Geneva, 2015.

[18]      M. Gentile and A. E. Summers, “Common Cause Failure: How Do You Manage Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331–338, 2006.

[19]     Out of Control—Why control systems go wrong and how to prevent failure, 2nd ed. Richmond, Surrey, UK: HSE Health and Safety Executive, 2003.

[20]     Safeguarding of Machinery. 3rd Edition. CSA Standard Z432. 2016.

[21]     O. Reg. 851, INDUSTRIAL ESTABLISHMENTS. Ontario, Canada, 1990.

31-Dec-2011 – Are YOU ready?

This entry is part 8 of 8 in the series Circuit Architectures Explored

31-December-2011 marks a key milestone for machine builders marketing their products in the European Union, the EEA and many of the Candidate States. Functional Safety takes a positive step forward with the mandatory application of EN ISO 13849-1 and -2. As of 1-January-2012, the safety-related parts of the control systems on all machinery bearing a CE Mark will be required to meet these standards.

This change started six years ago, when these standards were first harmonized under the Machinery Directive. The EC Machinery Committee gave machine builders an additional three years to make the transition to these standards, after much opposition to the original mandatory implementation date of 31-Dec-08 was announced.

If you aren’t aware of these standards, or if you aren’t familiar with the concept of functional safety, you need to get up to speed, and fast.

Under EN 954-1:1995 and the 1st Edition of ISO 13849-1, published in 1999, a designer needed to select a design Category or architecture, that would provide the degree of fault tolerance and reliability needed based on the outcome of the risk assessment for the machinery. The Categories, B, 1-4, remain unchanged in the 2nd Edition. I’ve talked about the Categories in detail in other posts, so I won’t spend any time on them here.

The 2nd Edition brings Mean Time to Failure into the picture, along with Diagnostic Coverage and Common Cause Failures. These new concepts require designers to use more analytical techniques in developing their designs, and also require additional documentation (as usual!).

One of the main failings with EN 954-1 was Validation. This topic was supposed to have been covered by EN 954-2, but this standard was never published. This has led machine builders to make design decisions without keeping the necessary design documentation trail, and furthermore, to skip the Validation step entirely in many cases.

The missing Validation standard was finally published in 2003 as ISO 13849-2:2003, and subsequently adopted and harmonized in 2009 as EN ISO 13849-2:2003. While no mandatory implementation date for this standard is given in the current list of standards harmonized under 2006/42/EC-Machinery, use of Part 1 of the standard mandates use of Part 2, so this standard is effectively mandatory at the same time.

Part 2 brings a number of key annexes that are necessary for the implementation of Part 1, and also outlines the complete documentation trail needed for validation, and coincidentally, audit. Notified bpdies will be looking for this information when evaluating the content of Technical Files used in CE Marking.

From a North American perspective, these two standards gain access through ANSI’s adoption of ISO 10218 for Industrial Robots. Part 1 of this standard, covering the robot itself, was adopted last year. Part 2 of the standard will be adopted in 2012, and RIA R15.06 will be withdrawn. At the same time, CSA will be adopting the ISO standards and withdrawing CSA Z434.

These changes will finally bring North America, the International Community and the EU onto the same footing when it comes to Functional Safety in industrial machinery applications. The days of “SIMPLE, SINGLE CHANNEL, SINGLE CHANNEL-MONITORED and CONTROL RELIABLE” are numbered.

Are you ready?

Compliance InSight Consulting will be offering a series of training events in 2012 on this topic. For more information, contact Doug Nix.

Interlock Architectures Pt. 6 – Comparing North American and International Systems

This entry is part 6 of 8 in the series Circuit Architectures Explored

I’ve now written six posts, including this one, on the topic of circuit architectures for the safety-related parts of control systems. In this post, we’ll compare the International and North American systems. This comparison is not intended to draw conclusions about which is “better”, but rather to compare and contrast the two systems so that designers can clearly see where the overlaps and the gaps in the systems exist.

Since we’ve spent a lot of time talking about ISO 13849-1 [1] in the previous five posts in this series, I think we should begin there by looking at Table 10 from the standard.

Table 10 — Summary of requirements for categories
Category Summary of requirements System behaviour Principle used
to achieve
safety
MTTFd
of each
channel
DCavg CCF
B
(see
6.2.3)
SRP/CS and/or their protective equipment, as well as their components, shall be designed, constructed, selected, assembled and combined in accordance with relevant standards so that they can withstand the expected influence.Basic safety principles shall be used. The occurrence of a fault can lead to the loss of the safety function. Mainly characterized by selection of components Low to medium None Not relevant
1
(see
6.2.4)
Requirements of B shall apply. Well-tried components and well-tried safety principles shall be used. The occurrence of a fault can lead to the loss of the safety function but the probability of occurrence is lower than for category B. Mainly characterized by selection of components High None Not relevant
2
(see
6.2.5)
Requirements of B and the use of well-tried safety principles shall apply. Safety function shall be checked at suitable intervals by the machine control system. The occurrence of a fault can lead to the loss of the safety function between the checks. The loss of safety function is detected by the check. Mainly characterized by structure Low to high Low to medium See Annex F
3
(see
6.2.6)
Requirements of B and the use of well-tried safety principles shall apply.Safety-related parts shall be designed, so that

—a single fault in any of these parts does not lead to the loss of the safety function, and

—whenever reasonably practicable, the single fault is detected.

When a single fault occurs, the safety function is always performed.Some, but not all, faults will be detected.

Accumulation of undetected faults can lead to the loss of the safety function.

 Mainly
characterized
by structure
Low to
high
Low to
medium
 See
Annex F
 4
(see
6.2.7)
Requirements of B and the use of well-tried safety principles shall apply. Safety-related parts shall be designed, so that
—a single fault in any of these parts does not lead to a loss of the safety function, and

—the single fault is detected at or before the next demand upon the safety function, but that if this detection is not possible, an accumulation of undetected faults shall not lead to the loss of the safety function.

 

When a single fault occurs the safety function is always performed. Detection of accumulated faults reduces the probability of the loss of the safety function (high DC). The faults will be detected in time to prevent the loss of the safety function.  Mainly characterized by structure  High  High including accumulation of faults  See Annex F
NOTE For full requirements, see Clause 6.

Table 10 summarizes all the key requirements for the five categories of architecture, giving the fundamental mechanism for achieving safety, the required MTTFd, DC and CCF. Note that fault exclusion can be used in Categories 3 and 4. There is no similar table available for CSA Z432 [2] or RIA R 15.06 [3], so I have constructed one following a similar format to Table 10.

Summary of requirements for CSA Z432 / Z434 and RIA R15.06
CSA Z432-04 / Z434-03 RIA R15.06 1999
Category  Summary of requirements  System behaviour  Principle used
to achieve
safety
Summary of requirements
All Safety control systems (electric, hydraulic, pneumatic) shall meet one of the performance criteria listed in Clauses 4.5.2 to 4.5.5. Safety circuits (electric, hydraulic, pneumatic) shall meet one of the performance criteria listed in 4.5.1 through 4.5.4.2

2 These performance criteria are not to be confused with the European categories B to 3 as described in ISO/IEC DIS 13849-1, Safety of machinery – Safety-related parts of control systems – Part 1: General principles for design (in correlation with EN 954-1.) They are different. The committee believes that the criteria in 4.5.1-4.5.4 exceed the criteria of B – 3 respectively, and further believe the reverse is not true.

SIMPLE Simple safety control systemsshall be designed and constructed using accepted single channel circuitry.Such systems may be programmable.

Note: This type of system should be used for signalling and annunciation purposes only.

The occurrence of a fault can lead to the loss of the safety function. Mainly characterized by component selection. Simple safety circuits shall be designed and constructed using accepted single channel
circuitry, and may be programmable.
SINGLE
CHANNEL
Single channel safety control systems shalla) be hardware based or comply with Clause 6.5;

b) include components that should be safety rated; and

c) be used in accordance with manufacturers’ recommendations and proven circuit designs (e.g., a single channel electromechanical positive break device that signals a stop in a de-energized state).

Note: In this type of system a single component failure can lead to the loss of the safety function.

The occurrence of a fault can lead to the loss of the safety function. Mainly characterized by component selection. Single channel safety circuits shall be hardware based or comply with 6.4, include components
which should be safety rated, be used in compliance with manufacturers’ recommendations
and proven circuit designs (e.g. a single channel electro-mechanical positive break device which signals a stop in a de-energized state.)
SINGLE CHANNEL
WITH
MONITORING
Single channel safety control systems with monitoring shall include the requirements for single channel,
be safety rated, and be checked (preferably automatically) at suitable intervals in accordance with the following:a) The check of the safety function(s) shall be performed

i) at machine start-up; and

ii) periodically during operation (preferably at each change in state).

b) The check shall either

i) allow operation if no faults have been detected; or

ii) generate a stop if a fault is detected. A warning shall be provided if a hazard remains after cessation of motion.

c) The check itself shall not cause a hazardous situation.

d) Following detection of a fault, a safe state shall be maintained until the fault is cleared.

Note: In this type of circuit a single component failure can also lead to the loss of the safety function.

The occurrence of a fault can lead to the loss of the safety function. Characterized by both component selection and structure. Single channel with monitoring safety circuits shall include the requirements for single channel,
shall be safety rated, and shall be checked (preferably automatically) at suitable intervals.a) The check of the safety function(s) shall be performed

1) at machine start-up, and

2) periodically during operation;

b) The check shall either:

1) allow operation if no faults have been detected, or

2) generate a stop signal if a fault is detected.
A warning shall be provided if a hazard remains after cessation of motion;

c) The check itself shall not cause a hazardous situation;

d) Following detection of a fault, a safe state shall be maintained until the fault is cleared.

CONTROL RELIABLE Control reliable safety control systems shall be dual channel with monitoring and shall be designed,
constructed, and applied such that any single component failure, including monitoring, shall not prevent
the stopping action of the robot.
These safety control systems shall be hardware based or in accordance with Clause 6.5. The systems shall include automatic monitoring at the system level conforming to the following:a) The monitoring shall generate a stop if a fault is detected. A warning shall be provided if a hazard remains after cessation of motion.

b) Following detection of a fault, a safe state shall be maintained until the fault is cleared.

c) Common mode failures shall be taken into account when the probability of such a failure occurring is
significant.

d) The single fault should be detected at time of failure. If not practicable, the failure shall be detected
at the next demand upon the safety function.

e) These safety control systems shall be independent of the normal program control (function) and shall be designed to be not easily defeated or not easily bypassed without detection.

When a single fault occurs, the safety function is always performed.Some, but not all, faults will be detected.

Accumulation of undetected faults can lead to the loss of the safety function.

Characterized primarily by structure. Control reliable safety circuitry shall be designed, constructed and applied such that any single component failure shall not prevent the stopping action of the robot.These circuits shall be hardware based or comply with 6.4, and include automatic monitoring at the system level.

a) The monitoring shall generate a stop signal if a fault is detected. A warning shall be provided if a hazard remains after cessation of motion;

b) Following detection of a fault, a safe state shall be maintained until the fault is cleared.

c) Common mode failures shall be taken into account when the probability of such a failure occurring is significant.

d) The single fault should be detected at time of failure. If not practicable, the failure shall be detected at the next demand upon the safety function.

CSA Z434 vs. RIA R15.06

Before we dig into the comparison between North America and the International standards, we need to look at the differences between CSA and ANSI/RIA. There are some subtle differences here that can trip you up and cost significant money to correct after the fact. The following statements are based on my personal experience and on discussions that I have had with people on both the CSA and RIA technical committees tasked with writing these standards. One more note – ANSI RIA R15.06 has been revised and ALL OF SECTION 4 has been replaced with ANSI/RIA/ISO 10218-1 [7]. This is very significant, but we need to deal with this old discussion first.

Systems vs. Circuits

The CSA standard uses the term “control system(s)” throughout the definitions of the categories, while the ANSI/RIA standard uses the term “circuit(s)”. This is really the crux of the discussion between these two standards. While the difference between the terms may seem insignificant at first, you need to understand the background to get the difference.

The CSA term requires two separate sensing devices on the gate or other guard, just as the Category 3 and 4 definitions do, and for the same reason. The CSA committee felt that it was important to be able to detect all single faults, including mechanical ones. Also, the use of two interlocking devices on the guard makes it more difficult to bypass the interlock.

The RIA term requires redundant electrical connections to the interlocking device, but implicitly allows for a single interlocking device because it only explicitly refers to “circuits”.

The explanation I’ve been given for the discrepancy is rooted in the early days of industrial robotics. Many early robot cells had NO interlocks on the guarding because the hazards related to the robot motion was not well understood. There were a number of incidents resulting in fatalities that drove robot users to begin to seek better ways to protect workers. The RIA R15.06 committee decided that interlocks were needed, but there was a recognition that many users would balk at installing expensive interlock devices, so they compromised and allowed that ANY kind of interlocking device was better than none. This was amended in the 1999 edition to require that components be “safety rated”, effectively eliminating the use of conventional proximity switches and non-safety-rated limit switches.

The recent revision of ANSI/RIA R15.06 to include ANSI/ISO 10218-1 as a replacement for Section 4 is significant for a couple of reasons: 1) It now means that the robot itself need only meet the ISO standard; instead of the ISO and the RIA standards; and 2) It brings in ISO 13849-1 definitions of reliability categories. This means that the US has now officially dropped the “SIMPLE, SINGLE-CHANNEL,” etc. definitions and now uses “Category B, 1, etc.” However, they have only adopted the Edition 1 version of the standard, so none of the PL, MTTFd, etc. calculations have been adopted. This means that the RIA standard is now harmonized to the 1995 edition of EN 954-1. These updates to the 2006 edition may come in subsequent editions of R15.06.

CSA has chosen to reaffirm the 2003 edition of CSA Z434, so the Canadian National Standard continues to refer to the old definitions.

North America vs International Standards

In the description of single-channel systems / circuits under the North American standards you will notice that particular attention is paid to including descriptions of the use of “proven designs” and “positive-break devices”. What the TC’s were referring to are the same “well-tried safety principles” and “well-tried components” as referred to in the International standards, only with less description of what those might be. The only major addition to the definitions is the recommendation to use “safety-rated devices”, which is not included in the International standard. (N.B. The use of the word “should” in the definitions should be understood as a strong recommendation, but not necessarily a mandatory requirement.) Under EN 954-1 [4] and EN 1088 [5] (in the referenced editions, in any case) it was possible to use standard limit switches arranged in a redundant manner and activated using combined positive and non-positive-mode activation. In later editions this changed, and there is now a preference for devices intended for use in safety applications.

Also worth noting is that there is NO allowance for fault exclusion under the CSA standard or the 1999 edition of the ANSI standard.

As far as the RIA committee’s assertion that their definitions are not equivalent to the International standard, and may be superior, I think that there are too may missing qualities in the ANSI standard for that to stand. In any case, this is now moot, since ANSI has adopted EN ISO 13849-1:2006 as a reference to EN ISO 10218-1 [6], replacing Section 4 of ANSI/RIA R15.06-1999.

References

[1] “Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design”, ISO 13849-1, Edition 2, International Organization for Standardization (ISO), Geneva, 2006.

[2] “Safeguarding of machinery”, CSA Z432, Canadian Standards Association (CSA), Toronto, 2004.

[3] “American National Standard for Industrial Robots and Robot Systems — Safety Requirements”, ANSI/RIA R15.06, American National Standards Institute, Inc. (ANSI), Ann Arbor, 1999.

[4] “Safety of machinery — Safety related parts of control systems — Part 1. General principles for design”, EN 954-1, European Committee for Standardization (CEN), Geneva, 1996.

[5] “Safety of machinery — Interlocking devices associated with guards — Principles for design and selection”, EN 1088, CEN, Geneva, 1995.

[6] “Robots and robotic devices — Safety requirements for industrial robots — Part 1: Robots”, European Committee for Standardization (CEN), Geneva, 2011.

[7] “Robots for Industrial Environment – Safety Requirements – Part 1 – Robot”, ANSI/RIA/ISO 10218-1, American National Standards Institute, Inc. (ANSI), Ann Arbor, 2007.

Digiprove sealCopyright secured by Digiprove © 2011-2012
Acknowledgements: See references listed at end of article.
Some Rights Reserved