ISO 13849-1 Analysis — Part 6: CCF — Common Cause Failures

This entry is part 6 of 6 in the series How to do a 13849-1 analysis

What is a Common Cause Failure?

There are two similar-sounding terms that people often get confused: Common Cause Failure (CCF) and Common Mode Failure. While these two types of failures sound similar, they are different. A Common Cause Failure is a failure in a system where two or more portions of the system fail at the same time from a single common cause. An example could be a lightning strike that causes a contactor to weld and simultaneously takes out the safety relay processor that controls the contactor. Common cause failures are therefore two different manners of failure in two different components, but with a single cause.

Common Mode Failure is where two components or portions of a system fail in the same way, at the same time. For example, two interposing relays both fail with welded contacts at the same time. The failures could be caused by the same cause or from different causes, but the way the components fail is the same.

Common-cause failure includes common mode failure, since a common cause can result in a common manner of failure in identical devices used in a system.

Here are the formal definitions of these terms:

3.1.6 common cause failure CCF

failures of different items, resulting from a single event, where these failures are not consequences of each other

Note 1 to entry: Common cause failures should not be confused with common mode failures (see ISO 12100:2010, 3.36). [SOURCE: IEC 60050?191-am1:1999, 04-23.] [1]

 

3.36 common mode failures

failures of items characterized by the same fault mode

NOTE Common mode failures should not be confused with common cause failures, as the common mode failures can result from different causes. [lEV 191-04-24] [3]

The “common mode” failure definition uses the phrase “fault mode”, so let’s look at that as well:

failure mode
DEPRECATED: fault mode
manner in which failure occurs

Note 1 to entry: A failure mode may be defined by the function lost or other state transition that occurred. [IEV 192-03-17] [17]

As you can see, “fault mode” is no longer used, in favour of the more common “failure mode”, so it is possible to re-write the common-mode failure definition to read, “failures of items characterised by the same manner of failure.”

Random, Systematic and Common Cause Failures

Why do we need to care about this? There are three manners in which failures occur: random failures, systematic failures, and common cause failures. When developing safety related controls, we need to consider all three and mitigate them as much as possible.

Random failures do not follow any pattern, occurring randomly over time, and are often brought on by over-stressing the component, or from manufacturing flaws. Random failures can increase due to environmental or process-related stresses, like corrosion, EMI, normal wear-and-tear, or other over-stressing of the component or subsystem. Random failures are often mitigated through selection of high-reliability components [18].

Systematic failures include common-cause failures, and occur because some human behaviour occurred that was not caught by procedural means. These failures are due to design, specification, operating, maintenance, and installation errors. When we look at systematic errors, we are looking for things like training of the system designers, or quality assurance procedures used to validate the way the system operates. Systematic failures are non-random and complex, making them difficult to analyse statistically. Systematic errors are a significant source of common-cause failures because they can affect redundant devices, and because they are often deterministic, occurring whenever a set of circumstances exist.

Systematic failures include many types of errors, such as:

  • Manufacturing defects, e.g., software and hardware errors built into the device by the manufacturer.
  • Specification mistakes, e.g. incorrect design basis and inaccurate software specification.
  • Implementation errors, e.g., improper installation, incorrect programming, interface problems, and not following the safety manual for the devices used to realise the safety function.
  • Operation and maintenance, e.g., poor inspection, incomplete testing and improper bypassing [18].

Diverse redundancy is commonly used to mitigate systematic failures, since differences in component or subsystem design tend to create non-overlapping systematic failures, reducing the likelihood of a common error creating a common-mode failure. Errors in specification, implementation, operation and maintenance are not affected by diversity.

Fig 1 below shows the results of a small study done by the UK’s Health and Safety Executive in 1994 [19] that supports the idea that systematic failures are a significant contributor to safety system failures. The study included only 34 systems (n=34), so the results cannot be considered conclusive. However, there were some startling results. As you can see, errors in the specification of the safety functions (Safety Requirement Specification) resulted in about 44% of the system failures in the study. Based on this small sample, systematic failures appear to be a significate source of failures.

Pie chart illustrating the proportion of failures in each phase of the life cycle of a machine, based on data taken from HSE Report HSG238.
Figure 1 – HSG 238 Primary Causes of Failure by Life Cycle Stage

Handling CCF in ISO 13849-1

Now that we understand WHAT Common-Cause Failure is, and WHY it’s important, we can talk about HOW it is handled in ISO 13849-1. Since ISO 13849-1 is intended to be a simplified functional safety standard, CCF analysis is limited to a checklist in Annex F, Table F.1. Note that Annex F is informative, meaning that it is guidance material to help you apply the standard. Since this is the case, you could use any other means suitable for assessing CCF mitigation, like those in IEC 61508, or in other standards.

Table F.1 is set up with a series of mitigation measures which are grouped together in related categories. Each group is provided with a score that can be claimed if you have implemented the mitigations in that group. ALL OF THE MEASURES in each group must be fulfilled in order to claim the points for that category. Here’s an example:

A portion of ISO 13849-1 Table F.1.
ISO 13849-1:2015, Table F.1 Excerpt

In order to claim the 20 points available for the use of separation or segregation in the system design, there must be a separation between the signal paths. Several examples of this are given for clarity.

Table F.1 lists six groups of mitigation measures. In order to claim adequate CCF mitigation, a minimum score of 65 points must be achieved. Only Category 2, 3 and 4 architectures are required to meet the CCF requirements in order to claim the PL, but without meeting the CCF requirement you cannot claim the PL, regardless of whether the design meets the other criteria or not.

One final note on CCF: If you are trying to review an existing control system, say in an existing machine, or in a machine designed by a third party where you have no way to determine the experience and training of the designers or the capability of the company’s change management process, then you cannot adequately assess CCF [8]. This fact is recognised in CSA Z432-16 [20], chapter 8. [20] allows the reviewer to simply verify that the architectural requirements, exclusive of any probabilistic requirements, have been met. This is particularly useful for engineers reviewing machinery under Ontario’s Pre-Start Health and Safety requirements [21], who are frequently working with less-than-complete design documentation.

In case you missed the first part of the series, you can read it here. In the next article in this series, I’m going to review the process flow for system analysis as currently outlined in ISO 13849-1. Watch for it!

Book List

Here are some books that I think you may find helpful on this journey:

[0]     B. Main, Risk Assessment: Basics and Benchmarks, 1st ed. Ann Arbor, MI USA: DSE, 2004.

[0.1]  D. Smith and K. Simpson, Safety critical systems handbook. Amsterdam: Elsevier/Butterworth-Heinemann, 2011.

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

[0.3]  Overview of techniques and measures related to EMC for Functional Safety, 1st ed. Stevenage, UK: Overview of techniques and measures related to EMC for Functional Safety, 2013.

References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. The complete reference list is included in the last post of the series.

[1]     Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. 3rd Edition. ISO Standard 13849-1. 2015.

[2]     Safety of machinery — Safety-related parts of control systems — Part 2: Validation. 2nd Edition. ISO Standard 13849-2. 2012.

[3]      Safety of machinery — General principles for design — Risk assessment and risk reduction. ISO Standard 12100. 2010.

[8]     S. Jocelyn, J. Baudoin, Y. Chinniah, and P. Charpentier, “Feasibility study and uncertainties in the validation of an existing safety-related control circuit with the ISO 13849-1:2006 design standard,” Reliab. Eng. Syst. Saf., vol. 121, pp. 104–112, Jan. 2014.

[17]      “failure mode”, 192-03-17, International Electrotechnical Vocabulary. IEC International Electrotechnical Commission, Geneva, 2015.

[18]      M. Gentile and A. E. Summers, “Common Cause Failure: How Do You Manage Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331–338, 2006.

[19]     Out of Control—Why control systems go wrong and how to prevent failure, 2nd ed. Richmond, Surrey, UK: HSE Health and Safety Executive, 2003.

[20]     Safeguarding of Machinery. 3rd Edition. CSA Standard Z432. 2016.

[21]     O. Reg. 851, INDUSTRIAL ESTABLISHMENTS. Ontario, Canada, 1990.

31-Dec-2011 – Are YOU ready?

This entry is part 8 of 8 in the series Circuit Architectures Explored

31-December-2011 marks a key milestone for machine builders marketing their products in the European Union, the EEA and many of the Candidate States. Functional Safety takes a positive step forward with the mandatory application of EN ISO 13849-1 and -2. As of 1-January-2012, the safety-related parts of the control systems on all machinery bearing a CE Mark will be required to meet these standards.

This change started six years ago, when these standards were first harmonized under the Machinery Directive. The EC Machinery Committee gave machine builders an additional three years to make the transition to these standards, after much opposition to the original mandatory implementation date of 31-Dec-08 was announced.

If you aren’t aware of these standards, or if you aren’t familiar with the concept of functional safety, you need to get up to speed, and fast.

Under EN 954-1:1995 and the 1st Edition of ISO 13849-1, published in 1999, a designer needed to select a design Category or architecture, that would provide the degree of fault tolerance and reliability needed based on the outcome of the risk assessment for the machinery. The Categories, B, 1-4, remain unchanged in the 2nd Edition. I’ve talked about the Categories in detail in other posts, so I won’t spend any time on them here.

The 2nd Edition brings Mean Time to Failure into the picture, along with Diagnostic Coverage and Common Cause Failures. These new concepts require designers to use more analytical techniques in developing their designs, and also require additional documentation (as usual!).

One of the main failings with EN 954-1 was Validation. This topic was supposed to have been covered by EN 954-2, but this standard was never published. This has led machine builders to make design decisions without keeping the necessary design documentation trail, and furthermore, to skip the Validation step entirely in many cases.

The missing Validation standard was finally published in 2003 as ISO 13849-2:2003, and subsequently adopted and harmonized in 2009 as EN ISO 13849-2:2003. While no mandatory implementation date for this standard is given in the current list of standards harmonized under 2006/42/EC-Machinery, use of Part 1 of the standard mandates use of Part 2, so this standard is effectively mandatory at the same time.

Part 2 brings a number of key annexes that are necessary for the implementation of Part 1, and also outlines the complete documentation trail needed for validation, and coincidentally, audit. Notified bpdies will be looking for this information when evaluating the content of Technical Files used in CE Marking.

From a North American perspective, these two standards gain access through ANSI’s adoption of ISO 10218 for Industrial Robots. Part 1 of this standard, covering the robot itself, was adopted last year. Part 2 of the standard will be adopted in 2012, and RIA R15.06 will be withdrawn. At the same time, CSA will be adopting the ISO standards and withdrawing CSA Z434.

These changes will finally bring North America, the International Community and the EU onto the same footing when it comes to Functional Safety in industrial machinery applications. The days of “SIMPLE, SINGLE CHANNEL, SINGLE CHANNEL-MONITORED and CONTROL RELIABLE” are numbered.

Are you ready?

Compliance InSight Consulting will be offering a series of training events in 2012 on this topic. For more information, contact Doug Nix.

Inconsistencies in ISO 13849-1:2006

This entry is part 7 of 8 in the series Circuit Architectures Explored

I’ve written quite a bit recently on the topic of circuit architectures under ISO 13849-1, and one of my readers noticed an inconsistency between the text of the standard and Figure 5, the diagram that shows how the categories can span one or more Performance Levels.

ISO 13849-1 Figure 5
ISO 13849-1, Figure 5: Relationship between Categories, DC, MTTFd and PL

If you look at Category 2 in Figure 5, you will notice that there are TWO bands, one for DCavg LOW and one for DCavg MED. However, reading the text of the definition for Category 2 gives (§6.2.5):

The diagnostic coverage (DCavg) of the total SRP/CS including fault-detection shall be low.

This leaves some confusion, because it appears from the diagram that there are two options for this architecture. This is backed up by the data in Annex K that underlies the diagram.

The same confusion exists in the text describing Category 3, with Figure 5 showing two bands, one for DCavg LOW and one for DCavg MED.

I contacted the ISO TC199 Secretariat, the people responsible for the content of ISO 13849-1, and pointed out this apparent conflict. They responded that they would pass the comment on to the TC for resolution, and would contact me if they needed additional information. As of this writing, I have not heard more.

So what should you do if you are trying to design to this standard? My advice is to follow Figure 5. If you can achieve a DCavg MED in your design, it is completely reasonable to claim a higher PL. Refer to the data in Annex K to see where your design falls once you have completed the MTTFd calculations.

Thanks to Richard Harris and Douglas Florence, both members of the ISO 13849 and IEC 62061 Group on LinkedIn for bringing this to my attention!

If you are interested in contacting the TC199 Secretariat, you can email the Secretary, Mr. Stephen Kennedy. More details on ISO TC199 can be found on the Technical Committee page on the ISO web Site.