## ISO 13849-1 Analysis — Part 6: CCF — Common Cause Failures

This entry is part 6 of 6 in the series How to do a 13849-1 analysis

# What is a Common Cause Failure?

There are two similar-sounding terms that people often get confused: Common Cause Failure (CCF) and Common Mode Failure. While these two types of failures sound similar, they are different. A Common Cause Failure is a failure in a system where two or more portions of the system fail at the same time from a single common cause. An example could be a lightning strike that causes a contactor to weld and simultaneously takes out the safety relay processor that controls the contactor. Common cause failures are therefore two different manners of failure in two different components, but with a single cause.

Common Mode Failure is where two components or portions of a system fail in the same way, at the same time. For example, two interposing relays both fail with welded contacts at the same time. The failures could be caused by the same cause or from different causes, but the way the components fail is the same.

Common-cause failure includes common mode failure, since a common cause can result in a common manner of failure in identical devices used in a system.

Here are the formal definitions of these terms:

3.1.6 common cause failure CCF

failures of different items, resulting from a single event, where these failures are not consequences of each other

Note 1 to entry: Common cause failures should not be confused with common mode failures (see ISO 12100:2010, 3.36). [SOURCE: IEC 60050?191-am1:1999, 04-23.] [1]

3.36 common mode failures

failures of items characterized by the same fault mode

NOTE Common mode failures should not be confused with common cause failures, as the common mode failures can result from different causes. [lEV 191-04-24] [3]

The “common mode” failure definition uses the phrase “fault mode”, so let’s look at that as well:

failure mode
DEPRECATED: fault mode
manner in which failure occurs

Note 1 to entry: A failure mode may be defined by the function lost or other state transition that occurred. [IEV 192-03-17] [17]

As you can see, “fault mode” is no longer used, in favour of the more common “failure mode”, so it is possible to re-write the common-mode failure definition to read, “failures of items characterised by the same manner of failure.”

# Random, Systematic and Common Cause Failures

Why do we need to care about this? There are three manners in which failures occur: random failures, systematic failures, and common cause failures. When developing safety related controls, we need to consider all three and mitigate them as much as possible.

Random failures do not follow any pattern, occurring randomly over time, and are often brought on by over-stressing the component, or from manufacturing flaws. Random failures can increase due to environmental or process-related stresses, like corrosion, EMI, normal wear-and-tear, or other over-stressing of the component or subsystem. Random failures are often mitigated through selection of high-reliability components [18].

Systematic failures include common-cause failures, and occur because some human behaviour occurred that was not caught by procedural means. These failures are due to design, specification, operating, maintenance, and installation errors. When we look at systematic errors, we are looking for things like training of the system designers, or quality assurance procedures used to validate the way the system operates. Systematic failures are non-random and complex, making them difficult to analyse statistically. Systematic errors are a significant source of common-cause failures because they can affect redundant devices, and because they are often deterministic, occurring whenever a set of circumstances exist.

Systematic failures include many types of errors, such as:

• Manufacturing defects, e.g., software and hardware errors built into the device by the manufacturer.
• Specification mistakes, e.g. incorrect design basis and inaccurate software specification.
• Implementation errors, e.g., improper installation, incorrect programming, interface problems, and not following the safety manual for the devices used to realise the safety function.
• Operation and maintenance, e.g., poor inspection, incomplete testing and improper bypassing [18].

Diverse redundancy is commonly used to mitigate systematic failures, since differences in component or subsystem design tend to create non-overlapping systematic failures, reducing the likelihood of a common error creating a common-mode failure. Errors in specification, implementation, operation and maintenance are not affected by diversity.

Fig 1 below shows the results of a small study done by the UK’s Health and Safety Executive in 1994 [19] that supports the idea that systematic failures are a significant contributor to safety system failures. The study included only 34 systems (n=34), so the results cannot be considered conclusive. However, there were some startling results. As you can see, errors in the specification of the safety functions (Safety Requirement Specification) resulted in about 44% of the system failures in the study. Based on this small sample, systematic failures appear to be a significate source of failures.

# Handling CCF in ISO 13849-1

Now that we understand WHAT Common-Cause Failure is, and WHY it’s important, we can talk about HOW it is handled in ISO 13849-1. Since ISO 13849-1 is intended to be a simplified functional safety standard, CCF analysis is limited to a checklist in Annex F, Table F.1. Note that Annex F is informative, meaning that it is guidance material to help you apply the standard. Since this is the case, you could use any other means suitable for assessing CCF mitigation, like those in IEC 61508, or in other standards.

Table F.1 is set up with a series of mitigation measures which are grouped together in related categories. Each group is provided with a score that can be claimed if you have implemented the mitigations in that group. ALL OF THE MEASURES in each group must be fulfilled in order to claim the points for that category. Here’s an example:

In order to claim the 20 points available for the use of separation or segregation in the system design, there must be a separation between the signal paths. Several examples of this are given for clarity.

Table F.1 lists six groups of mitigation measures. In order to claim adequate CCF mitigation, a minimum score of 65 points must be achieved. Only Category 2, 3 and 4 architectures are required to meet the CCF requirements in order to claim the PL, but without meeting the CCF requirement you cannot claim the PL, regardless of whether the design meets the other criteria or not.

One final note on CCF: If you are trying to review an existing control system, say in an existing machine, or in a machine designed by a third party where you have no way to determine the experience and training of the designers or the capability of the company’s change management process, then you cannot adequately assess CCF [8]. This fact is recognised in CSA Z432-16 [20], chapter 8. [20] allows the reviewer to simply verify that the architectural requirements, exclusive of any probabilistic requirements, have been met. This is particularly useful for engineers reviewing machinery under Ontario’s Pre-Start Health and Safety requirements [21], who are frequently working with less-than-complete design documentation.

In case you missed the first part of the series, you can read it here. In the next article in this series, I’m going to review the process flow for system analysis as currently outlined in ISO 13849-1. Watch for it!

# Book List

Here are some books that I think you may find helpful on this journey:

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

## References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. The complete reference list is included in the last post of the series.

[17]      “failure mode”, 192-03-17, International Electrotechnical Vocabulary. IEC International Electrotechnical Commission, Geneva, 2015.

[18]      M. Gentile and A. E. Summers, “Common Cause Failure: How Do You Manage Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331–338, 2006.

[19]     Out of Control—Why control systems go wrong and how to prevent failure, 2nd ed. Richmond, Surrey, UK: HSE Health and Safety Executive, 2003.

## ISO 13849-1 Analysis — Part 5: Diagnostic Coverage (DC)

This entry is part 5 of 6 in the series How to do a 13849-1 analysis

# What is Diagnostic Coverage?

Understanding Diagnostic Coverage (DC) as it is used in ISO 13849-1 [1] is critical to analysing the design of any safety function assessed using this standard. In case you missed a previous part of the series, you can read it here.

In the last instalment of this series discussing MTTFD, I brought up the fact that everything fails eventually, and so everything has a natural failure rate. The bathtub curve shown at the top of this post shows a typical failure rate curve for most products. Failure rates tell you the average time (or sometimes the mean time) it takes for components or systems to fail. Failure rates are expressed in many ways, MTTFD and PFHd being the ways relevant to this discussion of ISO 13849 analysis. MTTFis given in years, and PFHd is given in fractional hours (1/h). As a reminder, PFHd stands for “Probability of dangerous Failure per Hour”.

Three of the standard architectures include automatic diagnostic functions, Categories 2, 3 and 4. As soon as we add diagnostics to the system, we need to know what faults the diagnostics can detect and how many of the dangerous failures relative to the total number of failures that represents. Diagnostic Coverage (DC) represents the ratio of dangerous failures that can be detected to the total dangerous failures that could occur, expressed as a percentage. There will be some failures that do not result in a dangerous failure, and those failures are excluded from DC because we don’t need to worry about them – if they occur, the system will not fail into a dangerous state.

Here’s the formal definition from [1]:

3.1.26 diagnostic coverage (DC)

measure of the effectiveness of diagnostics, which may be determined as the ratio between the failure rate of detected dangerous failures and the failure rate of total dangerous failures

Note 1 to entry: Diagnostic coverage can exist for the whole or parts of a safety-related system. For example, diagnostic coverage could exist for sensors and/or logic system and/or final elements. [SOURCE: IEC 61508-4:1998, 3.8.6, modified.]

That brings up two other related definitions that need to be kept in mind [1]:

3.1.4 failure

termination of the ability of an item to perform a required function

Note 1 to entry: After a failure, the item has a fault.

Note 2 to entry: “Failure” is an event, as distinguished from “fault”, which is a state.

Note 3 to entry: The concept as defined does not apply to items consisting of software only.

Note 4 to entry: Failures which only affect the availability of the process under control are outside of the scope of this part of ISO 13849. [SOURCE: IEC 60050–191:1990, 04-01.]

and the most important one [1]:

3.1.5 dangerous failure

failure which has the potential to put the SRP/CS in a hazardous or fail-to-function state

Note 1 to entry: Whether or not the potential is realized can depend on the channel architecture of the system; in redundant systems a dangerous hardware failure is less likely to lead to the overall dangerous or fail-to- function state.

Note 2 to entry: [SOURCE: IEC 61508–4, 3.6.7, modified.]

Just as a reminder, SRP/CS stands for “safety-related parts of control systems”.

## Failure Math

### Failure Rate Data Sources

To do any calculations, we need data, and this is true for failure rates as well. ISO 13849-1 provides some tables in the annexes that list some common types of components and their associated failure rates, and there are more failure rate tables in ISO 13849-2. A word of caution here: Do not mix sources of failure rate data, as the conditions under which that data is true won’t match the data in ISO 13849. There are a few good sources of failure rate data out there, for example, MIL-HDBK-217, Reliability Prediction of Electronic Equipment [15], as well as the database maintained by Exida. In any case, use a single source for your failure rate data.

### Failure Rate Variables

IEC 61508 [7] defines a number of variables related to failure rates. The lowercase Greek letter lambda, $\lambda$, is used to denote failures.

The common variable designations used are:

$\lambda$ = failures
$\lambda_{(t)}$= failure rate
$\lambda_s$ = “safe” failures
$\lambda_d$ = “dangerous” failures
$\lambda_{dd}$ = detectable “dangerous” failures
$\lambda_{du}$ = undetectable “dangerous” failures

### Calculating DC

Of these variables, we only need to concern ourselves with $\lambda_d$, $\lambda_{dd}$ and $\lambda_{du}$. To understand how these variables are used, we can express their relationship as

$\lambda_d=\lambda_{dd}+\lambda_{du}$

Following on that idea, the Diagnostic Coverage can be expressed as a percentage like this:

$DC\%=\frac{\lambda_{dd}}{\lambda_d}\times 100$

## Determining DC%

If you want to actually calculate DC%, you have some work ahead of you. Rather than going into the details here, I am going to refer you hardcore types to IEC 61508-2, Functional safety of electrical/electronic/programmable electronic safety-related systems – Part 2: Requirements for electrical/electronic/programmable electronic safety-related systems. This standard goes into some depth on how to determine failure rates and how to calculate the “Safe Failure Fraction,” a number which is related to DC but is not the same.

For everyone else, the good news is that you can use the table in Annex E to estimate the DC%. It’s worth noting here that Annex E is “Informative.” In standards-speak, this means that the information in the annex is not part of the “normative” text, which means that it is simply information to help you use the normative part of the standard. The design must conform to the requirements in the normative text if you want to claim conformity to the standard. The fact that [1, Annex E] is informative gives you the option to calculate the DC% value rather than selecting it from Table E.1. Using the calculated value would not violate the requirements in the normative text.

If you are using IFA SISTEMA [16] to do the calculations for you, you will find that the software limits you to selecting a single DC measure from Table E.1, and this principle applies if you are doing the calculations by hand too. Only one item from Table E.1 can be selected for a given safety function.

## Ranking DC

Once you have determined the DC for a safety function, you need to compare the DC value against [1, Table 5] to see if the DC is sufficient for the PLr you are trying to achieve. Table 5 bins the DC results into four ranges. Just like binning the PFHd values into five ranges helps to prevent precision bias in estimating the probability of failure of the complete system or safety function, the ranges in Table 5 helps to prevent precision bias in the calculated or selected DC values.

If the DC value was high enough for the PLr, then you are done with this part of the work. If not, you will need to go back to your design and add additional diagnostic features so that you can either select a higher coverage from [1, Table E.1] or calculate a higher value using [14].

## Multiple safety functions

When you have multiple safety functions that make up a complete safety system, for example, an emergency stop function and a guard interlocking function, the DC values need to be averaged to determine the overall DC for the complete system. [1, Annex E] provides you with a method to do this in Equation E.1.

Plug in the values for MTTFD and DC for each safety function, and calculate the resulting DCavg value for the complete system.

That’s it for this article. The next part will cover Common Cause Failures (CCF). Look for it on 20-Mar-17!

In case you missed the first part of the series, you can read it here.

## Book List

Here are some books that I think you may find helpful on this journey:

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

## References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. Included in the last post of the series is the complete reference list.

[16]     “IFA – Practical aids: Software-Assistent SISTEMA: Safety Integrity – Software Tool for the Evaluation of Machine Applications”, Dguv.de, 2017. [Online]. Available: http://www.dguv.de/ifa/praxishilfen/practical-solutions-machine-safety/software-sistema/index.jsp. [Accessed: 30- Jan- 2017].

## ISO 13849-1 Analysis — Part 4: MTTFD – Mean Time to Dangerous Failure

This entry is part 4 of 6 in the series How to do a 13849-1 analysis

Functional safety is all about the likelihood of a safety system failing to operate when you need it. Understanding Mean Time to Dangerous Failure, or MTTFD, is critical. If you have been reading about this topic at all, you may notice that I am abbreviating Mean Time to Dangerous Failure with all capital letters. Using MTTFD is a recent change that occurred in the third edition of ISO 13849-1, published in 2015. In the first and second editions, the correct abbreviation was MTTFd. Onward!

If you missed the third instalment in this series, you can read it here.

## Defining MTTFD

Let’s start by having a look at some key definitions. Looking at [1, Cl. 3], you will find:

3.1.1 safety–related part of a control system (SRP/CS)—part of a control system that responds to safety-related input signals and generates safety-related
output signals

Note 1 to entry: The combined safety-related parts of a control system start at the point where the safety-related input signals are initiated (including, for example, the actuating cam and the roller of the position switch) and end at the output of the power control elements (including, for example, the main contacts of a contactor)

Note 2 to entry: If monitoring systems are used for diagnostics, they are also considered as SRP/CS.

3.1.5 dangerous failure—failure which has the potential to put the SRP/CS in a hazardous or fail-to-function state

Note 1 to entry: Whether or not the potential is realized can depend on the channel architecture of the system;
in redundant systems a dangerous hardware failure is less likely to lead to the overall dangerous or fail-tofunction
state.

Note 2 to entry: [SOURCE: IEC 61508–4, 3.6.7, modified.]

3.1.25 mean time to dangerous failure (MTTFD)—expectation of the mean time to dangerous failure

Definition 3.1.5 is pretty helpful, but definition 3.1.25 is, well, not much of a definition. Let’s look at this another way.

## Failures and Faults

Since everything can and will eventually fail to perform the way we expect it to, we know that everything has a failure rate because everything takes some time to fail. Granted that this time may be very short, like the first time the unit is turned on, or it may be very long, sometimes hundreds of years. Remember that because this is a rate, it is something that occurs over time. It is also important to be clear that we are talking about failures and not faults. Reading from [1]:

3.1.3 fault—state of an item characterized by the inability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to lack of external resources

Note 1 to entry: A fault is often the result of a failure of the item itself, but may exist without prior failure.

Note 2 to entry: In this part of ISO 13849, “fault” means random fault.
[SOURCE: IEC 60050?191:1990, 05-01.]

3.1.4 failure— termination of the ability of an item to perform a required function

Note 1 to entry: After a failure, the item has a fault.

Note 2 to entry: “Failure” is an event, as distinguished from “fault”, which is a state.

Note 3 to entry: The concept as defined does not apply to items consisting of software only.

Note 4 to entry: Failures which only affect the availability of the process under control are outside of the scope of this part of ISO 13849.
[SOURCE: IEC 60050–191:1990, 04-01.]

3.1.4 Note 2 is the important one at this point in the discussion.

Now, where we have multiples of something, like relays, valves, or safety systems, we now have a population of identical items, each of which will eventually fail at some point. We can count those failures as they occur and tally them up, and we can graph how many failures we get in the population over time. If this is starting to sound suspiciously like statistics to you, that is because it is.

OK, so let’s look at the kinds of failures that occur in that population. Some failures will result in a “safe” state, e.g., a relay failing with all poles open, and some will fail in a potentially “dangerous” state, like a normally closed valve developing a significant leak. If we tally up all the failures that occur, and then tally the number of “safe” failures and the number of “dangerous” failures in that population, we now have some very useful information.

The different kinds of failures are signified using the lowercase Greek letter $\lambda$ (lambda). We can add some subscripts to help identify what kinds of failures we are talking about. The common variable designations used are [14]:

$\lambda$ = failures
$\lambda_{(t)}$= failure rate
$\lambda_s$ = “safe” failures
$\lambda_d$ = “dangerous” failures
$\lambda_{dd}$ = detectable “dangerous” failures
$\lambda_{du}$ = undetectable “dangerous” failures

I will be discussing some of these variables in more detail in a later part of the series when I delve into Diagnostic Coverage, so don’t worry about them too much just yet.

## Getting to MTTFD

Since we can now start to deal with the failure rate data mathematically, we can start to do some calculations about expected lifetime of a component or a system. That expected, or probable, lifetime is what definition 3.1.25 was on about, and is what we call MTTFD.

MTTFD is the time in years over which the probability of failure is relatively constant. If you look at a typical failure rate curve, called a “bathtub curve” due to its resemblance to the profile of a nice soaker tub, the MTTFD is the flatter portion of the curve between the end of the infant mortality period and the wear-out period at the end of life. This part of the curve is the portion assumed to be included in the “mission time” for the product. ISO 13849-1 assumes the mission time for all machinery is 20 years [1, 4.5.4] and [1, Cl. 10].

ISO 13849-1 provides us with guidance on how MTTFD relates to the determination of the PL in [1, Cl. 4.5.2]. MTTFD is further grouped into three bands as shown in [1, Table 4].

The notes for this table are important as well. Since you can’t read the notes particularly well in the table above, I’ve reproduced them here:

NOTE 1 The choice of the MTTFD ranges of each channel is based on failure rates found in the field as state-of-the-art, forming a kind of logarithmic scale fitting to the logarithmic PL scale. An MTTFD value of each channel less than three years is not expected to be found for real SRP/CS since this would mean that after one year about 30 % of all systems on the market will fail and will need to be replaced. An MTTFD value of each channel greater than 100 years is not acceptable because SRP/CS for high risks should not depend on the reliability of components alone. To reinforce the SRP/CS against systematic and random failure, additional means such as redundancy and testing should be required. To be practicable, the number of ranges was restricted to three. The limitation of MTTFD of each channel values to a maximum of 100 years refers to the single channel of the SRP/CS which carries out the safety function. Higher MTTFD values can be used for single components (see Table D.1).

NOTE 2 The indicated borders of this table are assumed within an accuracy of 5%.

The standard then tells us to select the MTTFD using a simple hierarchy [1, 4.5.2]:

For the estimation ofMTTFD of a component, the hierarchical procedure for finding data shall be, in the order given:

a) use manufacturer’s data;
b) use methods in Annex C and Annex D;
c) choose 10 years.

Why ten years? Ten years is half of the assumed mission lifetime of 20 years. More on mission lifetime in a later post.

Looking at [1, Annex C.2], you will find the “Good Engineering Practices” method for estimating MTTFD, presuming the manufacturer has not provided you with that information. ISO 13849-2 [2] has some reference tables that provide some general MTTFD values for some kinds of components, but not every part that exists can be listed. How can we deal with parts not listed? [1, Annex C.4] provides us with a calculation method for estimating MTTFD for pneumatic, mechanical and electromechanical components.

### Calculating MTTFD for pneumatic, mechanical and electromechanical components

I need to introduce you to a few more variables before we look at how to calculate MTTFD for a component.

Variables
Variable Description
B10 Number of cycles until 10% of the components fail (for pneumatic and electromechanical components)
B10D Number of cycles until 10% of the components fail dangerously (for pneumatic and electromechanical components)
T10D the mean time until 10% of the components fail dangerously
hop is the mean operation time, in hours per day;
dop is the mean operation time, in days per year;
tcycle is the mean operation time between the beginning of two successive cycles of the component. (e.g., switching of a valve) in seconds per cycle.
s seconds
h hours
a years

Knowing a few details we can calculate the MTTFD using [1, Eqn C.1]. We need to know the following parameters for the application:

• B10D
• hop
• dop
• tcycle

In order to use [1, Eqn. C.1], we need to first calculate nop, using [1, Eqn. C.2]:

We may also need one more calculation, [1, Eqn. C.4]:

## Example Calculation [1, C.4.3]

For a pneumatic valve, a manufacturer determines a mean value of 60 million cycles as B10D. The valve is used for two shifts each day on 220 operation days a year. The mean time between the beginning of two successive switching of the valve is estimated as 5 s. This yields the following values:

• dop of 220 days per year;
• hop of 16 h per day;
• tcycle of 5 s per cycle;
• B10D of 60 million cycles.

Doing the math, we get:

So there you have it, at least for a fairly simple case. There are more examples in ISO 13849-1, and I would encourage you to work through them. You can also find a wealth of examples in a report produced by the BGIA in Germany, called the Functional safety of machine controls (BGIA Report 2/2008e) [16]. The download for the report is linked from the reference list at the end of this article. If you are a SISTEMA user, there are lots of examples in the SISTEMA Cookbooks, and there are example files available so that you can see how to assemble the systems in the software.

The next part of this series covers Diagnostic Coverage (DC), and the average DC for multiple safety functions in a system, DCavg.

In case you missed the first part of the series, you can read it here.

## Book List

Here are some books that I think you may find helpful on this journey:

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

## References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. Included in the last post of the series is the complete reference list.

[15]    “The bathtub curve and product failure behavior part 1 of 2”, Findchart.co, 2017. [Online]. Available: http://findchart.co/download.php?aHR0cDovL3d3dy53ZWlidWxsLmNvbS9ob3R3aXJlL2lzc3VlMjEvaHQyMV8xLmdpZg. [Accessed: 03- Jan- 2017].

[16]   “Functional safety of machine controls – Application of EN ISO 13849 (BGIA Report 2/2008e)”, dguv.de, 2017. [Online]. Available: http://www.dguv.de/ifa/publikationen/reports-download/bgia-reports-2007-bis-2008/bgia-report-2-2008/index-2.jsp. [Accessed: 2017-01-04].