## ISO 13849 – 1 Analysis — Part 8: Fault Exclusion

This entry is part of 9 in the series How to do a 13849 – 1 ana­lys­is

# Fault Consideration & Fault Exclusion

ISO 13849 – 1, Chapter 7 [1, 7] dis­cusses the need for fault con­sid­er­a­tion and fault exclu­sion. Fault con­sid­er­a­tion is the pro­cess of examin­ing the com­pon­ents and sub-​systems used in the safety-​related part of the con­trol sys­tem (SRP/​CS) and mak­ing a list of all the faults that could occur in each one. This a def­in­itely non-​trivial exer­cise!

Thinking back to some of the earli­er art­icles in this series where I men­tioned the dif­fer­ent types of faults, you may recall that there are detect­able and undetect­able faults, and there are safe and dan­ger­ous faults, lead­ing us to four kinds of fault:

• Safe undetect­able faults
• Dangerous undetect­able faults
• Safe detect­able faults
• Dangerous undetect­able faults

For sys­tems where no dia­gnostics are used, Category B and 1, faults need to be elim­in­ated using inher­ently safe design tech­niques. Care needs to be taken when clas­si­fy­ing com­pon­ents as “well-​tried” versus using a fault exclu­sion, as com­pon­ents that might nor­mally be con­sidered “well-​tried” might not meet those require­ments in every applic­a­tion.

For sys­tems where dia­gnostics are part of the design, i.e., Category 2, 3, and 4, the fault lists are used to eval­u­ate the dia­gnost­ic cov­er­age (DC) of the test sys­tems. Depending on the archi­tec­ture, cer­tain levels of DC are required to meet the rel­ev­ant PL, see [1, Fig. 5]. The fault lists are start­ing point for the determ­in­a­tion of DC, and are an input into the hard­ware and soft­ware designs. All of the dan­ger­ous detect­able faults must be covered by the dia­gnostics, and the DC must be high enough to meet the PLr. for the safety func­tion.

The fault lists and fault exclu­sions are used in the Validation por­tion of this pro­cess as well. At the start of the Validation pro­cess flow chart [2, Fig. 1], you can see how the fault lists and the cri­ter­ia used for fault exclu­sion are used as inputs to the val­id­a­tion plan.

Faults that can be excluded do not need to val­id­ated, sav­ing time and effort dur­ing the sys­tem veri­fic­a­tion and val­id­a­tion (V & V). How is this done?

## Fault Consideration

The first step is to devel­op a list of poten­tial faults that could occur, based on the com­pon­ents and sub­sys­tems included in SRP/​CS. ISO 13849 – 2 [2] includes lists of typ­ic­al faults for vari­ous tech­no­lo­gies. For example, [2, Table A.4] is the fault list for mech­an­ic­al com­pon­ents.

[2] con­tains tables sim­il­ar to Table A.4 for:

• Pressure-​coil springs
• Directional con­trol valves
• Stop (shut-​off) valves/​non-​return (check) valves/​quick-​action vent­ing valves/​shuttle valves, etc.
• Flow valves
• Pressure valves
• Pipework
• Hose assem­blies
• Connectors
• Pressure trans­mit­ters and pres­sure medi­um trans­ducers
• Compressed air treat­ment — Filters
• Compressed-​air treat­ment — Oilers
• Compressed air treat­ment — Silencers
• Accumulators and pres­sure ves­sels
• Sensors
• Fluidic Information pro­cessing — Logical ele­ments
• etc.

As you can see, there are many dif­fer­ent types of faults that need to be con­sidered. Keep in mind that I did not give you all of the dif­fer­ent fault lists – this post would be a mile long if I did that! The point is that you need to devel­op a fault list for your sys­tem, and then con­sider the impact of each fault on the oper­a­tion of the sys­tem. If you have com­pon­ents or sub­sys­tems that are not lis­ted in the tables, then you need to devel­op your own fault lists for those items. Using Failure Modes and Effects Analysis (FMEA) tech­niques are usu­ally the best approach for these com­pon­ents [23], [24].

When con­sid­er­ing the faults to be included in the list there are a few things that should be con­sidered [1, 7.2]:

• if after the first fault occurs oth­er faults devel­op due to the first fault, then you can group those faults togeth­er as a single fault
• two or more single faults with a com­mon cause can be con­sidered as a single fault
• mul­tiple faults with dif­fer­ent causes but occur­ring sim­ul­tan­eously is con­sidered improb­able and does not need to be con­sidered

### Examples

A voltage reg­u­lat­or fails in a sys­tem power sup­ply so that the 24 Vdc out­put rises to an unreg­u­lated 36 Vdc (the intern­al power sup­ply bus voltage), and after some time has passed, two sensors fail, then all three fail­ures can be grouped and con­sidered as a single fault.

If a light­ning strike occurs on the power line and the res­ult­ing surge voltage on the 400 V mains causes an inter­pos­ing con­tact­or and the motor drive it con­trols to fail to danger, then these fail­ures may be grouped and con­sidered as one.

A pneu­mat­ic lub­ric­at­or runs out of lub­ric­ant and is not refilled, depriving down­stream pneu­mat­ic com­pon­ents of lub­ric­a­tion. The spool on the sys­tem dump valve sticks open because it is not cycled often enough. Neither of these fail­ures has the same cause, so there is no need to con­sider them as occur­ring sim­ul­tan­eously because the prob­ab­il­ity of both hap­pen­ing con­cur­rently is extremely small. One cau­tion: These two faults MAY have a com­mon cause – poor main­ten­ance. Even if this is true and you decide to con­sider them to be two faults with a com­mon cause, they could then be grouped as a single fault.

## Fault Exclusion

Once you have your well-​considered fault lists togeth­er, the next ques­tion is “Can any of the lis­ted faults be excluded?” This is a tricky ques­tion! There are a few points to con­sider:

• Does the sys­tem archi­tec­ture allow for fault exclu­sion?
• Is the fault tech­nic­ally improb­able, even if it is pos­sible?
• Does exper­i­ence show that the fault is unlikely to occur?*
• Are there tech­nic­al require­ments related to the applic­a­tion and the haz­ard that might sup­port fault exclu­sion?

BE CAREFUL with this one!

Whenever faults are excluded, a detailed jus­ti­fic­a­tion for the exclu­sion needs to be included in the sys­tem design doc­u­ment­a­tion. Simply decid­ing that the fault can be excluded is NOT ENOUGH! Consider the risk a per­son will be exposed to in the event the fault occurs. If the sever­ity is very high, i.e., severe per­man­ent injury or death, you may not want to exclude the fault even if you think you could. Careful con­sid­er­a­tion of the res­ult­ing injury scen­ario is needed.

Basing a fault exclu­sion on per­son­al exper­i­ence is sel­dom con­sidered adequate, which is why I added the aster­isk (*) above. Look for good stat­ist­ic­al data to sup­port any decision to use a fault exclu­sion.

There is much more inform­a­tion avail­able in IEC 61508 – 2 on the sub­ject of fault exclu­sion, and there is good inform­a­tion in some of the books men­tioned below [0.2], [0.3], and [0.4]. If you know of addi­tion­al resources you would like to share, please post the inform­a­tion in the com­ments!

# Definitions

3.1.3 fault
state of an item char­ac­ter­ized by the inab­il­ity to per­form a required func­tion, exclud­ing the inab­il­ity dur­ing pre­vent­ive main­ten­ance or oth­er planned actions, or due to lack of extern­al resources
Note 1 to entry: A fault is often the res­ult of a fail­ure of the item itself, but may exist without pri­or fail­ure.
Note 2 to entry: In this part of ISO 13849, “fault” means ran­dom fault. [SOURCE: IEC 60050?191:1990, 05 – 01.]

# Book List

Here are some books that I think you may find help­ful on this jour­ney:

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

## References

Note: This ref­er­ence list starts in Part 1 of the series, so “miss­ing” ref­er­ences may show in oth­er parts of the series. Included in the last post of the series is the com­plete ref­er­ence list.

[16]     “IFA – Practical aids: Software-​Assistent SISTEMA: Safety Integrity – Software Tool for the Evaluation of Machine Applications”, Dguv​.de, 2017. [Online]. Available: http://​www​.dguv​.de/​i​f​a​/​p​r​a​x​i​s​h​i​l​f​e​n​/​p​r​a​c​t​i​c​a​l​-​s​o​l​u​t​i​o​n​s​-​m​a​c​h​i​n​e​-​s​a​f​e​t​y​/​s​o​f​t​w​a​r​e​-​s​i​s​t​e​m​a​/​i​n​d​e​x​.​jsp. [Accessed: 30- Jan- 2017].

[17]      “fail­ure mode”, 192−03−17, International Electrotechnical Vocabulary. IEC International Electrotechnical Commission, Geneva, 2015.

[18]      M. Gentile and A. E. Summers, “Common Cause Failure: How Do You Manage Them?,” Process Saf. Prog., vol. 25, no. 4, pp. 331 – 338, 2006.

[19]     Out of Control — Why con­trol sys­tems go wrong and how to pre­vent fail­ure, 2nd ed. Richmond, Surrey, UK: HSE Health and Safety Executive, 2003.

[22]     “Field-​programmable gate array”, En​.wiki​pe​dia​.org, 2017. [Online]. Available: https://​en​.wiki​pe​dia​.org/​w​i​k​i​/​F​i​e​l​d​-​p​r​o​g​r​a​m​m​a​b​l​e​_​g​a​t​e​_​a​r​ray. [Accessed: 16-​Jun-​2017].

[23]     Analysis tech­niques for sys­tem reli­ab­il­ity – Procedure for fail­ure mode and effects ana­lys­is (FMEA). 2nd Ed. IEC Standard 60812. 2006.

[24]     “Failure mode and effects ana­lys­is”, En​.wiki​pe​dia​.org, 2017. [Online]. Available: https://​en​.wiki​pe​dia​.org/​w​i​k​i​/​F​a​i​l​u​r​e​_​m​o​d​e​_​a​n​d​_​e​f​f​e​c​t​s​_​a​n​a​l​y​sis. [Accessed: 16-​Jun-​2017].

## ISO 13849 – 1 Analysis — Part 5: Diagnostic Coverage (DC)

This entry is part 5 of 9 in the series How to do a 13849 – 1 ana­lys­is

# What is Diagnostic Coverage?

Understanding Diagnostic Coverage (DC) as it is used in ISO 13849 – 1 [1] is crit­ic­al to ana­lys­ing the design of any safety func­tion assessed using this stand­ard. In case you missed a pre­vi­ous part of the series, you can read it here.

In the last instal­ment of this series dis­cuss­ing MTTFD, I brought up the fact that everything fails even­tu­ally, and so everything has a nat­ur­al fail­ure rate. The bathtub curve shown at the top of this post shows a typ­ic­al fail­ure rate curve for most products. Failure rates tell you the aver­age time (or some­times the mean time) it takes for com­pon­ents or sys­tems to fail. Failure rates are expressed in many ways, MTTFD and PFHd being the ways rel­ev­ant to this dis­cus­sion of ISO 13849 ana­lys­is. MTTFis giv­en in years, and PFHd is giv­en in frac­tion­al hours (1/​h). As a remind­er, PFHd stands for “Probability of dan­ger­ous Failure per Hour”.

Three of the stand­ard archi­tec­tures include auto­mat­ic dia­gnost­ic func­tions, Categories 2, 3 and 4. As soon as we add dia­gnostics to the sys­tem, we need to know what faults the dia­gnostics can detect and how many of the dan­ger­ous fail­ures rel­at­ive to the total num­ber of fail­ures that rep­res­ents. Diagnostic Coverage (DC) rep­res­ents the ratio of dan­ger­ous fail­ures that can be detec­ted to the total dan­ger­ous fail­ures that could occur, expressed as a per­cent­age. There will be some fail­ures that do not res­ult in a dan­ger­ous fail­ure, and those fail­ures are excluded from DC because we don’t need to worry about them – if they occur, the sys­tem will not fail into a dan­ger­ous state.

Here’s the form­al defin­i­tion from [1]:

3.1.26 dia­gnost­ic cov­er­age (DC)

meas­ure of the effect­ive­ness of dia­gnostics, which may be determ­ined as the ratio between the fail­ure rate of detec­ted dan­ger­ous fail­ures and the fail­ure rate of total dan­ger­ous fail­ures

Note 1 to entry: Diagnostic cov­er­age can exist for the whole or parts of a safety-​related sys­tem. For example, dia­gnost­ic cov­er­age could exist for sensors and/​or logic sys­tem and/​or final ele­ments. [SOURCE: IEC 61508 – 4:1998, 3.8.6, mod­i­fied.]

That brings up two oth­er related defin­i­tions that need to be kept in mind [1]:

3.1.4 fail­ure

ter­min­a­tion of the abil­ity of an item to per­form a required func­tion

Note 1 to entry: After a fail­ure, the item has a fault.

Note 2 to entry: “Failure” is an event, as dis­tin­guished from “fault”, which is a state.

Note 3 to entry: The concept as defined does not apply to items con­sist­ing of soft­ware only.

Note 4 to entry: Failures which only affect the avail­ab­il­ity of the pro­cess under con­trol are out­side of the scope of this part of ISO 13849. [SOURCE: IEC 60050 – 191:1990, 04 – 01.]

and the most import­ant one [1]:

3.1.5 dan­ger­ous fail­ure

fail­ure which has the poten­tial to put the SRP/​CS in a haz­ard­ous or fail-​to-​function state

Note 1 to entry: Whether or not the poten­tial is real­ized can depend on the chan­nel archi­tec­ture of the sys­tem; in redund­ant sys­tems a dan­ger­ous hard­ware fail­ure is less likely to lead to the over­all dan­ger­ous or fail-​to- func­tion state.

Note 2 to entry: [SOURCE: IEC 61508 – 4, 3.6.7, mod­i­fied.]

Just as a remind­er, SRP/​CS stands for “safety-​related parts of con­trol sys­tems”.

## Failure Math

### Failure Rate Data Sources

To do any cal­cu­la­tions, we need data, and this is true for fail­ure rates as well. ISO 13849 – 1 provides some tables in the annexes that list some com­mon types of com­pon­ents and their asso­ci­ated fail­ure rates, and there are more fail­ure rate tables in ISO 13849 – 2. A word of cau­tion here: Do not mix sources of fail­ure rate data, as the con­di­tions under which that data is true won’t match the data in ISO 13849. There are a few good sources of fail­ure rate data out there, for example, MIL-​HDBK-​217, Reliability Prediction of Electronic Equipment [15], as well as the data­base main­tained by Exida. In any case, use a single source for your fail­ure rate data.

### Failure Rate Variables

IEC 61508 [7] defines a num­ber of vari­ables related to fail­ure rates. The lower­case Greek let­ter lambda, $\lambda$, is used to denote fail­ures.

The com­mon vari­able des­ig­na­tions used are:

$\lambda$ = fail­ures
$\lambda_{(t)}$= fail­ure rate
$\lambda_s$ = “safe” fail­ures
$\lambda_d$ = “dan­ger­ous” fail­ures
$\lambda_{dd}$ = detect­able “dan­ger­ous” fail­ures
$\lambda_{du}$ = undetect­able “dan­ger­ous” fail­ures

### Calculating DC

Of these vari­ables, we only need to con­cern ourselves with $\lambda_d$, $\lambda_{dd}$ and $\lambda_{du}$. To under­stand how these vari­ables are used, we can express their rela­tion­ship as

$\lambda_d=\lambda_{dd}+\lambda_{du}$

Following on that idea, the Diagnostic Coverage can be expressed as a per­cent­age like this:

$DC\%=\frac{\lambda_{dd}}{\lambda_d}\times 100$

## Determining DC%

If you want to actu­ally cal­cu­late DC%, you have some work ahead of you. Rather than going into the details here, I am going to refer you hard­core types to IEC 61508 – 2, Functional safety of electrical/​electronic/​programmable elec­tron­ic safety-​related sys­tems – Part 2: Requirements for electrical/​electronic/​programmable elec­tron­ic safety-​related sys­tems. This stand­ard goes into some depth on how to determ­ine fail­ure rates and how to cal­cu­late the “Safe Failure Fraction,” a num­ber which is related to DC but is not the same.

For every­one else, the good news is that you can use the table in Annex E to estim­ate the DC%. It’s worth not­ing here that Annex E is “Informative.” In standards-​speak, this means that the inform­a­tion in the annex is not part of the “norm­at­ive” text, which means that it is simply inform­a­tion to help you use the norm­at­ive part of the stand­ard. The design must con­form to the require­ments in the norm­at­ive text if you want to claim con­form­ity to the stand­ard. The fact that [1, Annex E] is inform­at­ive gives you the option to cal­cu­late the DC% value rather than select­ing it from Table E.1. Using the cal­cu­lated value would not viol­ate the require­ments in the norm­at­ive text.

If you are using IFA SISTEMA [16] to do the cal­cu­la­tions for you, you will find that the soft­ware lim­its you to select­ing a single DC meas­ure from Table E.1, and this prin­ciple applies if you are doing the cal­cu­la­tions by hand too. Only one item from Table E.1 can be selec­ted for a giv­en safety func­tion.

## Ranking DC

Once you have determ­ined the DC for a safety func­tion, you need to com­pare the DC value against [1, Table 5] to see if the DC is suf­fi­cient for the PLr you are try­ing to achieve. Table 5 bins the DC res­ults into four ranges. Just like bin­ning the PFHd val­ues into five ranges helps to pre­vent pre­ci­sion bias in estim­at­ing the prob­ab­il­ity of fail­ure of the com­plete sys­tem or safety func­tion, the ranges in Table 5 helps to pre­vent pre­ci­sion bias in the cal­cu­lated or selec­ted DC val­ues.

If the DC value was high enough for the PLr, then you are done with this part of the work. If not, you will need to go back to your design and add addi­tion­al dia­gnost­ic fea­tures so that you can either select a high­er cov­er­age from [1, Table E.1] or cal­cu­late a high­er value using [14].

## Multiple safety functions

When you have mul­tiple safety func­tions that make up a com­plete safety sys­tem, for example, an emer­gency stop func­tion and a guard inter­lock­ing func­tion, the DC val­ues need to be aver­aged to determ­ine the over­all DC for the com­plete sys­tem. [1, Annex E] provides you with a meth­od to do this in Equation E.1.

Plug in the val­ues for MTTFD and DC for each safety func­tion, and cal­cu­late the res­ult­ing DCavg value for the com­plete sys­tem.

That’s it for this art­icle. The next part will cov­er Common Cause Failures (CCF). Look for it on 20-​Mar-​17!

In case you missed the first part of the series, you can read it here.

## Book List

Here are some books that I think you may find help­ful on this jour­ney:

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

## References

Note: This ref­er­ence list starts in Part 1 of the series, so “miss­ing” ref­er­ences may show in oth­er parts of the series. Included in the last post of the series is the com­plete ref­er­ence list.

[16]     “IFA – Practical aids: Software-​Assistent SISTEMA: Safety Integrity – Software Tool for the Evaluation of Machine Applications”, Dguv​.de, 2017. [Online]. Available: http://​www​.dguv​.de/​i​f​a​/​p​r​a​x​i​s​h​i​l​f​e​n​/​p​r​a​c​t​i​c​a​l​-​s​o​l​u​t​i​o​n​s​-​m​a​c​h​i​n​e​-​s​a​f​e​t​y​/​s​o​f​t​w​a​r​e​-​s​i​s​t​e​m​a​/​i​n​d​e​x​.​jsp. [Accessed: 30- Jan- 2017].

## ISO 13849 – 1 Analysis — Part 4: MTTFD – Mean Time to Dangerous Failure

This entry is part 4 of 9 in the series How to do a 13849 – 1 ana­lys­is

Functional safety is all about the like­li­hood of a safety sys­tem fail­ing to oper­ate when you need it. Understanding Mean Time to Dangerous Failure, or MTTFD, is crit­ic­al. If you have been read­ing about this top­ic at all, you may notice that I am abbre­vi­at­ing Mean Time to Dangerous Failure with all cap­it­al let­ters. Using MTTFD is a recent change that occurred in the third edi­tion of ISO 13849 – 1, pub­lished in 2015. In the first and second edi­tions, the cor­rect abbre­vi­ation was MTTFd. Onward!

If you missed the third instal­ment in this series, you can read it here.

## Defining MTTFD

Let’s start by hav­ing a look at some key defin­i­tions. Looking at [1, Cl. 3], you will find:

3.1.1 safety – related part of a con­trol sys­tem (SRP/​CS)—part of a con­trol sys­tem that responds to safety-​related input sig­nals and gen­er­ates safety-​related
out­put sig­nals

Note 1 to entry: The com­bined safety-​related parts of a con­trol sys­tem start at the point where the safety-​related input sig­nals are ini­ti­ated (includ­ing, for example, the actu­at­ing cam and the roller of the pos­i­tion switch) and end at the out­put of the power con­trol ele­ments (includ­ing, for example, the main con­tacts of a con­tact­or)

Note 2 to entry: If mon­it­or­ing sys­tems are used for dia­gnostics, they are also con­sidered as SRP/​CS.

3.1.5 dan­ger­ous fail­ure—fail­ure which has the poten­tial to put the SRP/​CS in a haz­ard­ous or fail-​to-​function state

Note 1 to entry: Whether or not the poten­tial is real­ized can depend on the chan­nel archi­tec­ture of the sys­tem;
in redund­ant sys­tems a dan­ger­ous hard­ware fail­ure is less likely to lead to the over­all dan­ger­ous or fail-​tofunction
state.

Note 2 to entry: [SOURCE: IEC 61508 – 4, 3.6.7, mod­i­fied.]

3.1.25 mean time to dan­ger­ous fail­ure (MTTFD)—expect­a­tion of the mean time to dan­ger­ous fail­ure

Definition 3.1.5 is pretty help­ful, but defin­i­tion 3.1.25 is, well, not much of a defin­i­tion. Let’s look at this anoth­er way.

## Failures and Faults

Since everything can and will even­tu­ally fail to per­form the way we expect it to, we know that everything has a fail­ure rate because everything takes some time to fail. Granted that this time may be very short, like the first time the unit is turned on, or it may be very long, some­times hun­dreds of years. Remember that because this is a rate, it is some­thing that occurs over time. It is also import­ant to be clear that we are talk­ing about fail­ures and not faults. Reading from [1]:

3.1.3 fault—state of an item char­ac­ter­ized by the inab­il­ity to per­form a required func­tion, exclud­ing the inab­il­ity dur­ing pre­vent­ive main­ten­ance or oth­er planned actions, or due to lack of extern­al resources

Note 1 to entry: A fault is often the res­ult of a fail­ure of the item itself, but may exist without pri­or fail­ure.

Note 2 to entry: In this part of ISO 13849, “fault” means ran­dom fault.
[SOURCE: IEC 60050?191:1990, 05 – 01.]

3.1.4 fail­ure— ter­min­a­tion of the abil­ity of an item to per­form a required func­tion

Note 1 to entry: After a fail­ure, the item has a fault.

Note 2 to entry: “Failure” is an event, as dis­tin­guished from “fault”, which is a state.

Note 3 to entry: The concept as defined does not apply to items con­sist­ing of soft­ware only.

Note 4 to entry: Failures which only affect the avail­ab­il­ity of the pro­cess under con­trol are out­side of the scope of this part of ISO 13849.
[SOURCE: IEC 60050 – 191:1990, 04 – 01.]

3.1.4 Note 2 is the import­ant one at this point in the dis­cus­sion.

Now, where we have mul­tiples of some­thing, like relays, valves, or safety sys­tems, we now have a pop­u­la­tion of identic­al items, each of which will even­tu­ally fail at some point. We can count those fail­ures as they occur and tally them up, and we can graph how many fail­ures we get in the pop­u­la­tion over time. If this is start­ing to sound sus­pi­ciously like stat­ist­ics to you, that is because it is.

OK, so let’s look at the kinds of fail­ures that occur in that pop­u­la­tion. Some fail­ures will res­ult in a “safe” state, e.g., a relay fail­ing with all poles open, and some will fail in a poten­tially “dan­ger­ous” state, like a nor­mally closed valve devel­op­ing a sig­ni­fic­ant leak. If we tally up all the fail­ures that occur, and then tally the num­ber of “safe” fail­ures and the num­ber of “dan­ger­ous” fail­ures in that pop­u­la­tion, we now have some very use­ful inform­a­tion.

The dif­fer­ent kinds of fail­ures are sig­ni­fied using the lower­case Greek let­ter $\lambda$ (lambda). We can add some sub­scripts to help identi­fy what kinds of fail­ures we are talk­ing about. The com­mon vari­able des­ig­na­tions used are [14]:

$\lambda$ = fail­ures
$\lambda_{(t)}$= fail­ure rate
$\lambda_s$ = “safe” fail­ures
$\lambda_d$ = “dan­ger­ous” fail­ures
$\lambda_{dd}$ = detect­able “dan­ger­ous” fail­ures
$\lambda_{du}$ = undetect­able “dan­ger­ous” fail­ures

I will be dis­cuss­ing some of these vari­ables in more detail in a later part of the series when I delve into Diagnostic Coverage, so don’t worry about them too much just yet.

## Getting to MTTFD

Since we can now start to deal with the fail­ure rate data math­em­at­ic­ally, we can start to do some cal­cu­la­tions about expec­ted life­time of a com­pon­ent or a sys­tem. That expec­ted, or prob­able, life­time is what defin­i­tion 3.1.25 was on about, and is what we call MTTFD.

MTTFD is the time in years over which the prob­ab­il­ity of fail­ure is rel­at­ively con­stant. If you look at a typ­ic­al fail­ure rate curve, called a “bathtub curve” due to its resemb­lance to the pro­file of a nice soak­er tub, the MTTFD is the flat­ter por­tion of the curve between the end of the infant mor­tal­ity peri­od and the wear-​out peri­od at the end of life. This part of the curve is the por­tion assumed to be included in the “mis­sion time” for the product. ISO 13849 – 1 assumes the mis­sion time for all machinery is 20 years [1, 4.5.4] and [1, Cl. 10].

ISO 13849 – 1 provides us with guid­ance on how MTTFD relates to the determ­in­a­tion of the PL in [1, Cl. 4.5.2]. MTTFD is fur­ther grouped into three bands as shown in [1, Table 4].

The notes for this table are import­ant as well. Since you can’t read the notes par­tic­u­larly well in the table above, I’ve repro­duced them here:

NOTE 1 The choice of the MTTFD ranges of each chan­nel is based on fail­ure rates found in the field as state-​of-​the-​art, form­ing a kind of log­ar­ithmic scale fit­ting to the log­ar­ithmic PL scale. An MTTFD value of each chan­nel less than three years is not expec­ted to be found for real SRP/​CS since this would mean that after one year about 30 % of all sys­tems on the mar­ket will fail and will need to be replaced. An MTTFD value of each chan­nel great­er than 100 years is not accept­able because SRP/​CS for high risks should not depend on the reli­ab­il­ity of com­pon­ents alone. To rein­force the SRP/​CS against sys­tem­at­ic and ran­dom fail­ure, addi­tion­al means such as redund­ancy and test­ing should be required. To be prac­tic­able, the num­ber of ranges was restric­ted to three. The lim­it­a­tion of MTTFD of each chan­nel val­ues to a max­im­um of 100 years refers to the single chan­nel of the SRP/​CS which car­ries out the safety func­tion. Higher MTTFD val­ues can be used for single com­pon­ents (see Table D.1).

NOTE 2 The indic­ated bor­ders of this table are assumed with­in an accur­acy of 5%.

The stand­ard then tells us to select the MTTFD using a simple hier­archy [1, 4.5.2]:

For the estim­a­tion ofMTTFD of a com­pon­ent, the hier­arch­ic­al pro­ced­ure for find­ing data shall be, in the order giv­en:

a) use manufacturer’s data;
b) use meth­ods in Annex C and Annex D;
c) choose 10 years.

Why ten years? Ten years is half of the assumed mis­sion life­time of 20 years. More on mis­sion life­time in a later post.

Looking at [1, Annex C.2], you will find the “Good Engineering Practices” meth­od for estim­at­ing MTTFD, pre­sum­ing the man­u­fac­turer has not provided you with that inform­a­tion. ISO 13849 – 2 [2] has some ref­er­ence tables that provide some gen­er­al MTTFD val­ues for some kinds of com­pon­ents, but not every part that exists can be lis­ted. How can we deal with parts not lis­ted? [1, Annex C.4] provides us with a cal­cu­la­tion meth­od for estim­at­ing MTTFD for pneu­mat­ic, mech­an­ic­al and elec­tromech­an­ic­al com­pon­ents.

### Calculating MTTFD for pneumatic, mechanical and electromechanical components

I need to intro­duce you to a few more vari­ables before we look at how to cal­cu­late MTTFD for a com­pon­ent.

Variables
Variable Description
B10 Number of cycles until 10% of the com­pon­ents fail (for pneu­mat­ic and elec­tromech­an­ic­al com­pon­ents)
B10D Number of cycles until 10% of the com­pon­ents fail dan­ger­ously (for pneu­mat­ic and elec­tromech­an­ic­al com­pon­ents)
T life­time of the com­pon­ent
T10D the mean time until 10% of the com­pon­ents fail dan­ger­ously
hop is the mean oper­a­tion time, in hours per day;
dop is the mean oper­a­tion time, in days per year;
tcycle is the mean oper­a­tion time between the begin­ning of two suc­cess­ive cycles of the com­pon­ent. (e.g., switch­ing of a valve) in seconds per cycle.
s seconds
h hours
a years

Knowing a few details we can cal­cu­late the MTTFD using [1, Eqn C.1]. We need to know the fol­low­ing para­met­ers for the applic­a­tion:

• B10D
• hop
• dop
• tcycle

In order to use [1, Eqn. C.1], we need to first cal­cu­late nop, using [1, Eqn. C.2]:

We may also need one more cal­cu­la­tion, [1, Eqn. C.4]:

## Example Calculation [1, C.4.3]

For a pneu­mat­ic valve, a man­u­fac­turer determ­ines a mean value of 60 mil­lion cycles as B10D. The valve is used for two shifts each day on 220 oper­a­tion days a year. The mean time between the begin­ning of two suc­cess­ive switch­ing of the valve is estim­ated as 5 s. This yields the fol­low­ing val­ues:

• dop of 220 days per year;
• hop of 16 h per day;
• tcycle of 5 s per cycle;
• B10D of 60 mil­lion cycles.

Doing the math, we get:

So there you have it, at least for a fairly simple case. There are more examples in ISO 13849 – 1, and I would encour­age you to work through them. You can also find a wealth of examples in a report pro­duced by the BGIA in Germany, called the Functional safety of machine con­trols (BGIA Report 2/​2008e) [16]. The down­load for the report is linked from the ref­er­ence list at the end of this art­icle. If you are a SISTEMA user, there are lots of examples in the SISTEMA Cookbooks, and there are example files avail­able so that you can see how to assemble the sys­tems in the soft­ware.

The next part of this series cov­ers Diagnostic Coverage (DC), and the aver­age DC for mul­tiple safety func­tions in a sys­tem, DCavg.

In case you missed the first part of the series, you can read it here.

## Book List

Here are some books that I think you may find help­ful on this jour­ney:

[0.2]  Electromagnetic Compatibility for Functional Safety, 1st ed. Stevenage, UK: The Institution of Engineering and Technology, 2008.

## References

Note: This ref­er­ence list starts in Part 1 of the series, so “miss­ing” ref­er­ences may show in oth­er parts of the series. Included in the last post of the series is the com­plete ref­er­ence list.

[15]    “The bathtub curve and product fail­ure beha­vi­or part 1 of 2”, Findchart​.co, 2017. [Online]. Available: http://​find​chart​.co/​d​o​w​n​l​o​a​d​.​p​h​p​?​a​H​R​0​c​D​o​v​L​3​d​3​d​y​5​3​Z​W​l​i​d​W​x​s​L​m​N​v​b​S​9​o​b​3​R​3​a​X​J​l​L​2​l​z​c​3​V​l​M​j​E​v​a​H​Q​y​M​V​8​x​L​m​d​pZg. [Accessed: 03- Jan- 2017].

[16]   “Functional safety of machine con­trols – Application of EN ISO 13849 (BGIA Report 2/​2008e)”, dguv​.de, 2017. [Online]. Available: http://​www​.dguv​.de/​i​f​a​/​p​u​b​l​i​k​a​t​i​o​n​e​n​/​r​e​p​o​r​t​s​-​d​o​w​n​l​o​a​d​/​b​g​i​a​-​r​e​p​o​r​t​s​-​2​0​0​7​-​b​i​s​-​2​0​0​8​/​b​g​i​a​-​r​e​p​o​r​t-2 – 2008/index-2.jsp. [Accessed: 2017-​01-​04].