ISO 13849-1 Analysis — Part 4: MTTFD – Mean Time to Dangerous Failure

Post updated 2019-07-24. Ed.

Functional safety is all about the likelihood of a safety system failing to operate when you need it. Understanding Mean Time to Dangerous Failure, or MTTFD, is critical. If you have been reading about this topic, you may notice that I am abbreviating Mean Time to Dangerous Failure with all capital letters. Using MTTFD is a recent change in the third edition of ISO 13849-1 [1], published in 2015. In the first and second editions, the correct abbreviation was MTTFd. Onward!

If you missed the third instalment in this series, read it here.

Defining MTTFD

Let’s start by having a look at some key definitions. Looking at [1, Cl. 3], you will find:

3.1.1
safety-related part of a control system (SRP/CS)

part of a control system that responds to safety-related input signals and generates safety-related output signals

Note 1 to entry: The combined safety-related parts of a control system start at the point where the safety-related input signals are initiated (including, for example, the actuating cam and the roller of the position switch) and end at the output of the power control elements (including, for example, the main contacts of a contactor)

Note 2 to entry: If monitoring systems are used for diagnostics, they are also considered as SRP/CS.

3.1.5
dangerous failure

failure which has the potential to put the SRP/CS in a hazardous or fail-to-function state

Note 1 to entry: Whether or not the potential is realized can depend on the channel architecture of the system; in redundant systems a dangerous hardware failure is less likely to lead to the overall dangerous or fail-to-function state.

Note 2 to entry: [SOURCE: IEC 61508 – 4, 3.6.7, modified.]

3.1.25
mean time to dangerous failure (MTTFD)

expectation of the mean time to dangerous failure

ISO 13849-1:2015

Definition 3.1.5 is helpful, but definition 3.1.25 is not much of a definition. Let’s look at this another way.

Failures and Faults

Since everything can and will eventually fail to perform the way we expect it to, we know that everything has a failure rate because everything takes some time to fail. Granted, this time may be very short, like the first time the unit is turned on, or it may be very long, sometimes hundreds of years. Remember that because this is a rate, it is something that occurs over time. It is also important to be clear that we are talking about failures, not faults. Reading from [1]:

3.1.3
fault
state of an item characterized by the inability to perform a required function, excluding the inability during preventive maintenance or other planned actions, or due to lack of external resources

Note 1 to entry: A fault is often the result of a failure of the item itself, but may exist without prior failure.

Note 2 to entry: In this part of ISO 13849, “fault” means random fault.
[SOURCE: IEC 60050-191:1990, 05-01.]

3.1.4
failure
termination of the ability of an item to perform a required function

Note 1 to entry: After a failure, the item has a fault.

Note 2 to entry: “Failure” is an event, as distinguished from “fault”, which is a state.

Note 3 to entry: The concept as defined does not apply to items consisting of software only.

Note 4 to entry: Failures which only affect the availability of the process under control are outside of the scope of this part of ISO 13849.
[SOURCE: IEC 60050-191:1990, 04-01.]

ISO 13849-1:2015

Definition 3.1.4, Note 2, is important at this point in the discussion.

Now, where we have multiples of something, like relays, valves, or safety systems, we now have a population of identical items, each of which will eventually fail. We can count those failures as they occur, tally them up, and graph how many failures we get in the population over time. If this is starting to sound suspiciously like statistics to you, that is because it is.

OK, so let’s look at the kinds of failures in that population. Some failures will result in a “safe” state, e.g., a relay failing with all poles open, and some will fail in a potentially “dangerous” state, like a normally closed valve developing a significant leak. If we tally up all the failures and then tally the number of “safe” failures and the number of “dangerous” failures in that population, we now have some very useful information.

The different failures are signified using the lowercase Greek letter λ (lambda). We can add some subscripts to help identify what kinds of failures we are talking about. The common variable designations used are [14]:

λ = failures
λ(t) = failure rate
λs = “safe” failures
λd = “dangerous” failures
λdd = detectable “dangerous” failures
λdu = undetectable “dangerous” failures

I will discuss some of these variables in more detail in a later part of the series when I delve into Diagnostic Coverage, so don’t worry about them too much just yet.

Getting to MTTFD

Since we can now deal with the failure rate data mathematically, we can start to do some calculations about the expected lifetime of a component or a system. That expected, or probable, lifetime is what definition 3.1.25 was about and is what we call MTTFD.

MTTFD is the time in years over which the probability of failure is relatively constant. If you look at a typical failure rate curve, called a “bathtub curve” due to its resemblance to the profile of a nice soaker tub, the MTTFD is the flatter portion of the curve between the end of the infant mortality period and the wear-out period at the end of life. This part of the curve is the portion assumed to be included in the “mission time” for the product. The mission time is the time between the complete replacement or refurbishment of the safety-related parts of the control system. The refurbishment of the SRP/CS  includes the complete replacement of all wear parts of components in the SRP/CS.

ISO 13849-1 assumes the mission time for all machinery is 20 years [1, 4.5.4] and [1, Cl. 10].

Diagram of a standardized bathtub-shaped failure rate curve.
Figure 1 – Typical Bathtub Curve [15]
Table showing the bands of Mean time to dangerous failure of each channel (MTTF<sub>D</sub>)

ISO 13849-1 provides guidance on how MTTFD relates to the determination of the PL in [1, Cl. 4.5.2]. MTTFD is further grouped into three bands, as shown in [1, Table 4].

The notes for this table are important as well. Since you can’t read the notes particularly well in the table above, I’ve reproduced them here:

NOTE 1 The choice of the MTTFD ranges of each channel is based on failure rates found in the field as state-of-the-art, forming a kind of logarithmic scale fitting to the logarithmic PL scale. An MTTFD value of each channel less than three years is not expected to be found for real SRP/CS since this would mean that after one year about 30 % of all systems on the market will fail and will need to be replaced. An MTTFD value of each channel greater than 100 years is not acceptable because SRP/CS for high risks should not depend on the reliability of components alone. To reinforce the SRP/CS against systematic and random failure, additional means such as redundancy and testing should be required. To be practicable, the number of ranges was restricted to three. The limitation of MTTFD of each channel values to a maximum of 100 years refers to the single channel of the SRP/CS which carries out the safety function. Higher MTTFD values can be used for single components (see Table D.1).

NOTE 2 The indicated borders of this table are assumed within an accuracy of 5%.

The standard then tells us to select the MTTFD using a simple hierarchy:

For the estimation of MTTFD of a component, the hierarchical procedure for finding data shall be, in the order given:

a) use manufacturer’s data;
b) use methods in Annex C and Annex D;
c) choose 10 years.

[1, 4.5.2]

Why ten years? Ten years is half of the assumed mission lifetime of 20 years. More on mission lifetime in a later post.

Looking at [1, Annex C.2], you will find the “Good Engineering Practices” method for estimating MTTFD, presuming the manufacturer has not provided you with that information. ISO 13849-2 [2] has reference tables that provide general MTTFD values for some kinds of components, but not every part can be listed. How can we deal with parts not listed? [1, Annex C.4] provides a calculation method for estimating MTTFD for pneumatic, mechanical and electromechanical components.

Calculating MTTFD for pneumatic, mechanical and electromechanical components

I need to introduce you to a few more variables before we look at how to calculate MTTFD for a component.

VariableDescription
B10Number of cycles until 10% of the components fail (for pneumatic and electromechanical components)
B10DNumber of cycles until 10% of the components fail dangerously (for pneumatic and electromechanical components)
Tlifetime of the component
T10Dthe mean time until 10% of the components fail dangerously
hopis the mean operation time, in hours per day;
dopis the mean operation time, in days per year;
tcycleis the mean operation time between the beginning of two successive cycles of the component. (e.g., switching of a valve) in seconds per cycle.
sseconds
hhours
ayears

Knowing a few details, we can calculate the MTTFD using [1, Eqn C.1]. We need to know the following parameters for the application:

  • B10D
  • hop
  • dop
  • tcycle
Calculating MTTFD using B10D and nop.
[1, Eqn. C.1]

To use [1, Eqn. C.1], we need to first calculate nop, using [1, Eqn. C.2]:

Formula for calculating nop - ISO 13849-1, Equation C.2.
Calculating nop – [1, Eqn. C.2]

We may also need one more calculation, [1, Eqn. C.4]:

Calculating T10D using B10D and nop - ISO 13849-1 Eqn. C.3
Calculating T10D
[1, Eqn. C.4]

Example Calculation

This example calculation can be found in [1, C.4.3].

A manufacturer determines that a pneumatic valve has a B10D of 60 million cycles. The valve is used for two shifts daily on 220 operation days a year. The mean time between the beginning of two successive valve cycles is estimated as 5 seconds. This yields the following values:

  • dop = 220 days per year;
  • hop = 16 h per day;
  • tcycle = 5 s per cycle;
  • B10D = 60 million cycles.

Doing the math using the equations above, we get:

Example C.4.3 MTTFD calculations from, ISO 13849-1.
Example C.4.3

So there you have it, at least for a fairly simple case. There are more examples in ISO 13849-1, and I encourage you to work through them. You can also find a wealth of examples in a report produced by the BGIA in Germany called BGIA Report 2/2008e – Functional safety of machine controls [16] IFA Report 2/2017e – Functional safety of machine controls [17]. The download for the report is linked from the reference list at the end of this article. If you are a SISTEMA user, there are many examples in the SISTEMA Cookbooks, and there are example files available to see how to assemble the systems in the software.

The next part of this series covers Diagnostic Coverage (DC) and the average DC for multiple safety functions in a system, DCavg.

If you missed the first part of the series, read it here.


Book List

Here are some books that I think you may find helpful on this journey:

[0]     B. Main, Risk Assessment: Basics and Benchmarks, 1st ed. Ann Arbor, MI USA: DSE, 2004.

[0.1]  D. Smith and K. Simpson, Safety critical systems handbook. Amsterdam: Elsevier/Butterworth-Heinemann, 2011.

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

[0.3] Overview of techniques and measures related to EMC for Functional Safety, 1st ed. Stevenage, UK: Overview of techniques and measures related to EMC for Functional Safety, 2013.

[0.4] “Code of practice for electromagnetic resilience, 1st ed. Stevenage, UK: IET Standards TC4.3 EMC, 2017.

[0.5] “Code of Practice: Competence for Safety Related Systems Practitioners, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2016.


References

Note: This reference list starts in Part 1 of the series, so “missing” references may show in other parts of the series. Included in the last post of the series is the complete reference list.

[1]     Safety of machinery – Safety-related parts of control systems – Part 1: General principles for design. 3rd Edition. ISO Standard 13849-1. 2015.

[2]     Safety of machinery — Safety-related parts of control systems — Part 2: Validation. 2nd Edition. ISO Standard 13849-2. 2012.

[7]     Functional safety of electrical/electronic/programmable electronic safety-related systems. 7 parts. IEC Standard 61508. Second Edition. 2010.

[14]    Functional safety of electrical/electronic/programmable electronic safety-related systems — Part 4: Definitions and abbreviations. IEC Standard 61508-4. Second Edition. 2010.

[15]    “The bathtub curve and product failure behavior part 1 of 2”, Findchart.co, 2017. [Online]. Available: http://findchart.co/download.php?aHR0cDovL3d3dy53ZWlidWxsLmNvbS9ob3R3aXJlL2lzc3VlMjEvaHQyMV8xLmdpZg. [Accessed: 03- Jan- 2017].

[16]   “Functional safety of machine controls – Application of EN ISO 13849 (BGIA Report 2/2008e)”, dguv.de, 2009. [Online]. Available: http://www.dguv.de/ifa/publikationen/reports-download/bgia-reports-2007-bis-2008/bgia-report-2-2008/index-2.jsp. [Accessed: 2017-01-04].

[17] “Functional safety of machine controls – Application of EN ISO 13849. IFA Report 2/2017e”, dguv.de, 2019. [Online]. Available: https://www.dguv.de/ifa/publikationen/reports-download/reports-2017/ifa-report-2-2017/index-2.jsp. [Accessed: 2021-05-07].

© 2017 – 2022, Compliance inSight Consulting Inc. Creative Commons Licence
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

2 thoughts on “ISO 13849-1 Analysis — Part 4: MTTFD – Mean Time to Dangerous Failure

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.