Ed. note: I’ve made a few updates to this article since it was first published in 2011, with the most recent on 2020-05-11. – DN
The most reliable of the five system architectures, Category 4 is still considered single-fault tolerant and uses enhanced diagnostics (DC ≥ 99%) to help ensure that component failure does not result in an unacceptable exposure to risk. This post will delve into the depths of this architecture in this installment on system architectures. The definitions and requirements discussed in this article come from ISO 13849-1, Edition 3 (2015) [1] and ISO 13849-2, Edition 2 (2012) [2].
As with the preceding articles in this series, I’ll be building on concepts discussed in those articles. If you need more information, you should look at the previous articles to see if I’ve answered your questions.
The Definition
The Category 4 definition builds on both Category B and Category 3. As you read, recall that “SRP/CS” stands for “Safety-Related Parts of the Control System.” Here is the complete definition:
6.2.7 Category 4
For category 4, the same requirements as those according to 6.2.3 for category B shall apply. “Well-tried safety principles” according to 6.2.4 shall also be followed. In addition, the following applies.SRP/CS of category 4 shall be designed such that
- a single fault in any of these safety-related parts does not lead to a loss of the safety function, and
- the single fault is detected at or before the next demand upon the safety functions, e.g. immediately, at switch on, or at end of a machine operating cycle,
but if this detection is not possible, then an accumulation of undetected faults shall not lead to the loss of the safety function.
The diagnostic coverage (DCavg) of the total SRP/CS shall be high, including the accumulation of faults. The MTTFD of each of the redundant channels shall be high. Measures against CCF shall be applied (see Annex F).
NOTE 1
Category 4 system behaviour is characterized by
- continued performance of the safety function in the presence of a single fault,
- detection of faults in time to prevent the loss of the safety function,
- the accumulation of undetected faults is taken into account.
NOTE 2
The difference between category 3 and category 4 is a higher DCavg in category 4 and a required MTTFD of each channel of “high” only. In practice, the consideration of a fault combination of two faults may be sufficient.[1, 6.2.7]
5% Discount on ISO and IEC Standards with code: CC2011
Breaking it down
For category 4, the same requirements as those according to 6.2.3 for category B shall apply. “Well-tried safety principles” according to 6.2.4 shall also be followed.
The first two sentences give the basic requirement for all the categories from 2 through 4. Sound component selection based on the application requirements for voltage, current, switching capability and lifetime must be considered. In addition, using well-tried safety principles, such as switching the +V rail side of the coil circuit for control components, is required. If you aren’t sure about what constitutes a “well-tried safety principle,” see the article on Category 2 where this is discussed. Don’t confuse “well-tried safety principles” with “well-tried components”. There is no requirement in Category 4 for the use of well-tried components, although you can use them for additional reliability if the design requirements warrant it.
In addition, the following applies.
SRP/CS of category 4 shall be designed such that
- a single fault in any of these safety-related parts does not lead to a loss of the safety function, and
- the single fault is detected at or before the next demand upon the safety functions, e.g. immediately, at switch on, or at end of a machine operating cycle,
but if this detection is not possible, then an accumulation of undetected faults shall not lead to the loss of the safety function.
This is the big one. This paragraph and the two bullets that follow it define the fundamental performance requirements for this category. No single fault can lead to the loss of the safety function in Category 4, and testing is required to detect failures and accumulation of undetected faults cannot eventually lead to the loss of the safety function. The requirement regarding undetected faults leading to the loss of the safety function means that faults that would fall into what IEC 61508 calls “λdu” (i.e., dangerous undetectable faults) must be eliminated by design if the diagnostics cannot detect them, or the diagnostics need to be improved so that all dangerous undetectable faults become dangerous detectable faults or “λdd” faults. This increase in the diagnostic capability of the system is the fundamental difference between Category 3 and Category 4. Note that the next paragraph supports this.
The diagnostic coverage (DCavg) of the total SRP/CS shall be high, including the accumulation of faults. The MTTFD of each of the redundant channels shall be high. Measures against CCF shall be applied (see Annex F).
In Category 3, DCavg is required to be “at least low,” [1, 6.2.6], meaning 60-90% [1, Table 5]. So we go from a minimum of 60-90% in Category 3 to >99% in Category 4.
These three sentences give the designer the criteria for diagnostic coverage, channel failure rates and common cause failure protection. As you can see, the ability to diagnose failures automatically is a critical part of the design, as is the use of highly reliable components, leading to highly reliable channels. The strongest CCF protection you can include in the design is also needed, although the “passing score” of 65 remains unchanged (see Annex F in ISO 13849-1 for more details on scoring your design).
NOTE 1
Category 4 system behaviour is characterized by
- continued performance of the safety function in the presence of a single fault,
- detection of faults in time to prevent the loss of the safety function,
- the accumulation of undetected faults is taken into account.
NOTE 2
The difference between category 3 and category 4 is a higher DCavg in category 4 and a required MTTFD of each channel of “high” only. In practice, the consideration of a fault combination of two faults may be sufficient.
Note 1 expands on the first paragraph in the definition, further clarifying the performance requirements by explicit statements. Notice that nowhere is a requirement that single faults or accumulation of single faults be prevented, only detected by the diagnostic system. Prevention of single faults is nearly impossible since components do fail. It is important to understand 1) which components are critical to the safety function and 2) what kinds of faults each component is likely to have. This is fundamental to designing a diagnostic system that can detect faults.
The category relies on redundancy to ensure that the complete loss of one channel will not cause the loss of the safety function. However, this is only useful if the common cause failures have been properly dealt with. Otherwise, a single event could wipe out both channels simultaneously, causing the loss of the safety function and possibly resulting in an injury or fatality.
Also, notice that multiple single faults are permitted, as long as the accumulation does not result in the loss of the safety function. ISO 13849 allows for “fault exclusion,” a concept not used in North American standards.
The final sentence from Note 2 suggests that consideration of two concurrent faults may be enough, but be careful. You need to look closely at the fault lists to see if any groups of high-probability faults are likely to occur concurrently. If there are, you need to assess these combinations of faults, whether there are 5 or 50 to be evaluated.
Fault Exclusion
Fault exclusion involves assessing the types of faults occurring in each component in the system’s critical path. The decision to exclude certain kinds of faults is always a technical compromise between the theoretical improbability of the fault, the expertise of the designer(s) and engineers involved and the specific technical requirements of the application. Whenever the decision is made to exclude a particular type of fault, the decision and the process used to make it must be documented in the Reliability Report included in the design file. Section 7.3 of ISO 13849-1 guides fault exclusion.
In the section discussing Category 1, the standard has this to say about fault exclusion and the difference between “well-tried components” and “fault exclusion”:
It is important that a clear distinction between “well-tried component” and “fault exclusion” (see Clause 7) be made. The qualification of a component as being well-tried depends on its application. For example, a position switch with positive opening contacts could be considered as being well-tried for a machine tool, while at the same time as being inappropriate for application in a food industry — in the milk industry, for instance, this switch would be destroyed by the milk acid after a few months. A fault exclusion can lead to a very high PL, but the appropriate measures to allow this fault exclusion should be applied during the whole lifetime of the device. In order to ensure this, additional measures outside the control system may be necessary. In the case of a position switch, some examples of these kinds of measures are
- means to secure the fixing of the switch after its adjustment,
- means to secure the fixing of the cam,
- means to ensure the transverse stability of the cam,
- means to avoid over-travel of the position switch, e.g. adequate mounting strength of the shock absorber and any alignment devices, and
- means to protect it against damage from outside.
To assist the designer, ISO 13849-2 provides lists of typical faults and the allowable exclusions in Annex D.5. For example, let’s consider the typical situation where a robust guard interlocking device has been selected. The decision has been made to use redundant electrical circuits for the switching components in the interlock, so electrical faults can be detected. But what about mechanical failures? A fault list is needed:
# | Fault Description | Result | Likelihood |
1 | Key breaks off | Control system cannot determine guard position. Complete failure of system through a single fault. | Unlikely |
2 | Screws mounting key to guard fail | Control system cannot determine guard position. Complete failure of system through a single fault. | Unlikely |
3 | Screws mounting interlock device to guard fail | Control system cannot determine guard position. Complete failure of system through a single fault. | Unlikely |
4 | Key and interlock device misaligned. | Guard cannot close, preventing machine from operating. | Very likely |
5 | Key and interlock device misaligned. | Key and / or interlock device damaged. Guard may not close, or the key may jam in the interlock device once closed. Machine is inoperable if the interlock cannot be completed, or the guard cannot be opened if the key jams in the device. | Likely |
6 | Screws mounting key to guard removed by user. | Interlock can now be bypassed by fixing the key into the interlocking device. Control system can no longer sense the position of the guard. | Likely |
7 | Screws mounting interlock device to guard removed by user | Probably combined with the preceding condition. Control system can no longer sense the position of the guard. | Unlikely, but could happen. |
There may be more failure modes, but let’s limit them to this list for this discussion.
Looking at Fault 1, several things could result in a broken key. They include misalignment of the key and the interlock device, lack of maintenance on the guard and the interlocking hardware, or intentional damage by a user. Unless the hardware is exceptionally robust, including the design of the guard and any alignment features incorporated in the guarding, developing a sound rationale for excluding this fault will be very difficult.
Fault 2 considers the mechanical failure of the mounting screws for the interlock key. Screws are considered well-tried components (see Annex A.5), so you can consider them for fault exclusion. You can improve their reliability by using thread-locking adhesives when installing the screws to prevent them from vibrating loose and “tamper-proof” style screw heads to deter unauthorized removal. The inclusion of these methods will support any decision to exclude these faults. This goes to addressing faults 3, 6 and 7 as well.
Faults 4 & 5 occur frequently and are often caused by poor device selection (i.e. an interlock device intended for straight-line sliding-gate applications is chosen for a hinged gate) or by poor guard design (i.e. the guard is poorly guided by the retention mechanism and can be closed in a misaligned condition). The rationale for preventing these faults must include discussing design features that will prevent these conditions.
Excluding any other kind of fault follows the same process: Develop the fault list, assess each fault against the relevant Annex from ISO 13849-2, and determine if there are preventative measures that can be designed into the product and whether these provide sufficient risk reduction to allow the exclusion of the fault from consideration.
DCavg and MTTFD requirements
NOTE 2 The difference between category 3 and category 4 is a higher DCavg in category 4 and a required MTTFD of each channel of “high” only.
The first sentence in Note 2 clarifies the two main differences from a design standpoint, aside from the additional fault tolerance requirements: Better diagnostics and much higher requirements for an individual component and, therefore channel, MTTFD.
The Block Diagram
The block diagram for Category 4 is almost identical to Category 3 and was updated by Corrigendum 1 to the below diagram. The text from the corrigendum that accompanies the diagram has this to say about the change:
Replace the drawing showing the designated architecture for category 4 with the following drawing. This corrects the arrowed lines labeled “m” between L1 and O1, and L2 and O2, by changing them from dashed to solid lines, representing higher diagnostic coverage.
I’ve highlighted this area using red ovals in Figure 12 to make it easier to see.
Here is Figure 11 for comparison. Notice that the “m” lines are solid in Figure 12 and dashed in Figure 11? Subtle, but significant! There are no other differences between the diagrams.
I went looking for a circuit diagram to support the block diagram but couldn’t find one from a commercial source that I could share with you. Considering that the primary differences are in the chosen components’ reliability and how the testing is done, this isn’t surprising. The basic physical construction of the two categories can be virtually identical.
Applications
The following is not from the standards – this is my opinion based on more than 30 years of practice.
In the past, many manufacturers decided that they were going to apply Category 4 architecture without really understanding the design implications because they believed that it was “the best.” With the change in the harmonization of EN 954-1 [3] and ISO 13849-1 under the EU machinery directive that came into force on 29-Dec-2011, and considering the great difficulty that many manufacturers have had in properly implementing EN 954-1, I can easily imagine manufacturers who have taken the approach that they already have Category 4 SRP/CS on their systems and making the statement that they now have PLe SRP/CS system performance. This is a bad decision for a lot of reasons:
- ISO 13849-1 PLe, Category 4 systems should be reserved for dangerous machinery where the risk assessment warrants the technical effort and expense involved. Attempting to apply this level of design to machinery where a PLb performance level is more suitable based on a risk assessment is a waste of design time and effort and a needless expense. The product family standards for these types of machines, such as EN 201 [4] for plastic injection moulding machines or EN 692 [5], [6] for Mechanical Power Presses or EN 693 [7], [8] for Hydraulic Power Presses will explicitly specify the PL level required for these machines.
- Manufacturers have frequently claimed EN 954-1 Category 4 performance based on the rating of the safety relay alone, without understanding that the rest of the SRP/CS must be considered, and clearly, this is wrong. The SRP/CS must be evaluated as a complete system.
This lack of understanding endangers the users, the maintenance personnel, the owners and the manufacturers. Suppose they continue this approach and an injury occurs. In that case, I believe the courts will have more than enough evidence in the defendant’s published documents to cause serious legal grief.
As designers involved with the safety of our company’s products or our co-worker’s safety, I believe that we owe it to everyone who uses our products to be educated and correctly apply these concepts. The fact that you have read all of the posts leading up to this one is evidence that you are working on getting educated.
Always conduct a risk assessment and use the outcome from that work to guide your selection of safeguarding measures, complementary protective measures and the performance of the SRP/CS that ties those systems together. Choose performance levels that make sense based on the required risk reduction and ensure that the design criteria are met by validating the system once built.
As always, I welcome your comments and questions! Please feel free to comment below. I will respond to all your comments.
References
[1] Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design, ISO 13849-1. International Organization for Standardization (ISO). 2015
[2] Safety of machinery — Safety-related parts of control systems — Part 2: Validation, ISO 13849-2. International Organization for Standardization (ISO). 2012.
[3] Safety of Machinery — Safety Related Parts of Control Systems — Part 1: General Principles for Design, European Committee for Standardization (CEN). EN 954-1, 1996.
[4] Plastics and rubber machines — Injection moulding machines — Safety requirements, EN 201. European Committee for Standardization (CEN). 2009.
[5] Machine tools — Mechanical presses — Safety, EN 692. European Committee for Standardization (CEN). 2005+A1:2009. (withdrawn)
[6] Machine tools safety — Presses — General safety requirements, EN ISO 16092-1. European Committee for Standardization (CEN). 2018.
[7] Machine tools — Safety — Hydraulic presses, EN 693. European Committee for Standardization (CEN). 2001+A2:2011. (withdrawn)
[8] Machine tools safety — Presses — Safety requirements for hydraulic presses, EN ISO 16092-3. European Committee for Standardization (CEN). 2018.
© 2011 – 2022, Compliance inSight Consulting Inc.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Very informative read. One other thing I see frequently in my career is design of equipment that ALMOST meets a given category but with a small number (often only one!) of critical design errors that negates the rest of the design and potentially leads to a dangerous situation. For example a robot cell with an enterance for an automatic guided vehicle which is protected by light curtains is cannot reasonably be considered to be category 3 (or even 2!) if the light curtain muting is accomplished by means of a traditional PLC driving a relays… Even a pair of force guided relays does no good as the PLC output failing ?on? will create a (potentially) undetected dangerous situation where the light curtain is ?muted? (more accurately it is bypassed) unintentionally and without any indication that this has occurred.
The design is ONLY as good as the weakest link.
Thanks, Duncan!
I’ve seen exactly the same kinds of issues that you mention, and many others. One issue the ISO/TC 199 is working on right now is developing better data for fluidic components. If you look at the estimated MTTFD for a hydraulic valve with nop >= 1,000,000 cycles/a in ISO 13849-1, Table C.1, you’ll see an MTTFD = 150 a, with three additional entries for 500k-1 MM ops, 250k-500k ops and finally < 250k ops/a, but if you look at "pneumatic component" in the same table you see only B10D = 20 MM cycles . The TC is working with fluidic component manufacturers to begin providing better reliability data for these kinds of components, as there is currently little data available to most safety engineers unless your employer gives you access to exida's exSILentia databases. If you go to another free source, like MIL-HDBK-217, you won't find any fluidic components listed at all. Hopefully, we will soon be able to include better failure rate information on these ubiquitous components for use with ISO 13849.