Interlock Architectures – Pt. 1: What do those categories really mean?

The post has been updated since it was first written in 2010.

If you are new to functional safety, new to design of control systems for machinery, or both, this post and the subsequent posts covering the five architectural categories provided in ISO 13849-1. These categories are similar to those in EN 954-1:1996 but have been expanded to include some additional criteria. This post explores the categories to give you an introduction to the concepts used in ISO 13849-1.

Note that when this post was first written, ISO 13849-1:2006 was current. Since then, a new edition was published in 2015, and yet another is expected to be published by May-2021. The definitions discussed in this post are still valid.

What do those categories really mean?

The architectures used as the basis of interlock design and analysis have a long history. Two basic forms existed in the early days: the ANSI categories and the CSA variant, and the CEN forms.

The ANSI/CSA architectures were called SIMPLE, SINGLE CHANNEL, SINGLE CHANNEL-MONITORED, and CONTROL RELIABLE. The basic system arose in the ANSI/RIA R15.06 1992 standard and was used until 2014. The CSA variant used the same names as the ANSI version but made a small differentiation in the CONTROL RELIABLE category. This differentiation was very subtle and was often completely misunderstood by readers. This system was introduced in Canada in CSA Z434-1994 and was discontinued in 2016. This system of safety-related control system architecture categories is no longer used in any jurisdiction.

And then there was EN 954-1

In 1996 CEN published an important standard for machine builders – EN 954-1, “Safety of Machinery – Safety Related Parts of Control Systems – Part 1: General Principles for Design” [1]. This standard set the stage for defining control reliability in machinery safeguarding systems, introducing the Reliability categories that have become ubiquitous. So what do these categories mean, and how are they applied under the latest machinery functional safety standard, ISO 13849-1 [2]?

The Categories

The categories are used to describe system architectures for safety-related control systems. Each architecture carries with it a range of reliable performance that can be related to the degree of risk reduction you are expecting to achieve with the system. These architectures can be applied equally to electrical, electronic, pneumatic, hydraulic or mechanical control systems.

Historical Circuits

Early electrical ‘master-control-relay’ circuits used a simple architecture with a single contactor, or sometimes two, and a single channel style of architecture to maintain the contactor coil circuit once the START or POWER ON button (PB2 in Fig. 1) had been pressed. Power to the output elements of the machine controls was supplied via contacts on the contactor, which is why it was called the Master Control Relay or ‘MCR’. The POWER OFF button (PB1 in Fig. 1) could be labelled that way, or you could make the same circuit into an Emergency Stop by simply replacing the operator with a red mushroom-head push button. These devices were usually spring-return, so to restore power, all that was needed was to push the POWER ON button again (Fig.1).

Typically, the components used in these circuits were specified to meet the circuit conditions, but not more. Control manufacturers brought out over-dimensioned versions, such as Allen-Bradley’s Bulletin 700-PK contactor which had 20 A rated contacts instead of the standard Bulletin 700’s 10 A contacts.

When interlocked guards began to show up, they were integrated into the original MCR circuit by adding a basic control relay (CR1 in Fig. 2) whose coil was controlled by the interlock switch(es) (LS1 in Fig. 2), and whose output contacts were in series with the coil circuit of the MCR contactor. Opening the guard interlock would open the MCR coil circuit and drop power to the machine controls. Very simple.

Ice-cube’ style plug-in relays were often chosen for CR1. These devices did not have ‘force-guided’ contacts in them, so it was possible to have one contact in the relay fail while the other continued to operate properly.

LS1 could be any kind of switch. Frequently a ‘micro-switch’ style of limit switch was chosen. These snap-action switches could fail shorted internally, or weld closed and the actuator would continue to work normally even though the switch itself had failed. These switches are also ridiculously easy to bypass. All that is required is a piece of tape or an elastic band and the switch is no longer doing its job.

The problem with these circuits is that they can fail in a number of ways that aren’t obvious to the user, with the result being that the interlock might not work as expected, or the Emergency Stop might fail just when you need it most.

Modern Architectures

Category B

These original circuits are the basis for what became known as ‘Category B’ (‘B’ for ‘Basic’) circuits. Here’s the definition from the standard. Note that I am taking this excerpt from ISO 13849-1: 2007 (Edition 2). “SRP/CS” stands for “Safety Related Parts of Control Systems”:

6.2.3 Category B
The SRP/CS shall, as a minimum, be designed, constructed, selected, assembled and combined in accordance with the relevant standards and using basic safety principles for the specific application to withstand

• the expected operating stresses, e.g. the reliability with respect to breaking capacity and frequency,
• the influence of the processed material, e.g. detergents in a washing machine, and
• other relevant external influences, e.g. mechanical vibration, electromagnetic interference, power supply interruptions or disturbances.

There is no diagnostic coverage (DCavg = none) within category B systems and the MTTFd of each channel can be low to medium. In such structures (normally single-channel systems), the consideration of CCF is not relevant.

The maximum PL achievable with category B is PL = b.

NOTE When a fault occurs it can lead to the loss of the safety function.

Specific requirements for electromagnetic compatibility are found in the relevant product standards, e.g. IEC 61800-3 for power drive systems. For functional safety of SRP/CS in particular, the immunity requirements are relevant. If no product standard exists, at least the immunity requirements of IEC 61000-6-2 should be followed. [1]

The standard [1] also provides us with a nice logic block diagram of what a single-channel system might look like:

If you look at this block diagram and the Start/Stop Circuit with Guard Relay above, you can see how this basic circuit translates into a single channel architecture, since from the control inputs to the controlled load you have a single channel. Even the guard loop is a single channel. A failure in any component in the channel can result in loss of control of the load.

Lets look at each part of this requirement in more detail, since each of the subsequent Categories builds upon these BASIC requirements.

The SRP/CS shall, as a minimum, be designed, constructed, selected, assembled and combined in accordance with the relevant standards and using basic safety principles for the specific application…

[1]

Basic Safety Principles

We have to go to ISO 13849-2 to get a definition of what Basic Safety Principles might include. Looking at Annex A.2 of the standard we find:

As you can see, the basic safety principles are pretty basic – select components appropriately for the application, consider the operating conditions for the components, follow manufacturer’s data, and use de-energization to create the stop function. That way, a loss of power results in the system failing into a safe state, as does an open relay coil or set of burnt contacts.

“…the expected operating stresses, e.g. the reliability with respect to breaking capacity and frequency,”

Specify your components correctly with regard to voltage, current, breaking capacity, temperature, humidity, dust,…

“…other relevant external influences, e.g. mechanical vibration, electromagnetic interference, power supply interruptions or disturbances.”

“Specific requirements for electromagnetic compatibility are found in the relevant product standards, e.g. IEC 61800-3 for power drive systems. For functional safety of SRP/CS in particular, the immunity requirements are relevant. If no product standard exists, at least the immunity requirements of IEC 61000-6-2 should be followed.”

Probably the biggest ‘gotcha’ in this point is “electromagnetic interference”. This is important enough that the standard devotes a paragraph to it specifically. I added the bold text to highlight the idea of ‘functional safety’. You can find other information in other posts on this blog on that topic. If your product is destined for the European Union (EU), then you will almost certainly be doing some EMC testing, unless your product is a ‘fixed installation’. If it’s going to almost any other market, you probably are not undertaking this testing. So how do you know if your design meets this criteria? Unless you test, you don’t. You can make some educated guesses based on using sound engineering practices , but after that you can only hope.

Diagnostic Coverage

“…There is no diagnostic coverage (DCavg = none) within category B systems…”

Category B systems are fundamentally single-channel. A single fault in the system will lead to the loss of the safety function. This sentence refers to the concept of “diagnostic coverage” that was introduced in ISO 13849-1:2007, but what this means in practice is that there is no monitoring or feedback from any critical elements. Remember our basic MCR circuit? If the MCR contactor welded closed, the only diagnostic was the failure of the machine to stop when the emergency stop button was pressed.

Component Failure Rates

“…the MTTFd of each channel can be low to medium.”

This part of the statement is referring to another new concept from ISO 13849-1:2007, “MTTFd“. Standing for “Mean Time to Failure Dangerous”, this concept looks at the expected failure rates of the component in hours. Calculating MTTFd is a significant part of implementing the new standard. From the perspective of understanding Category B, what this means is that you do not need to use high-reliability components in these systems.

Common Cause Failures

“In such structures (normally single-channel systems), the consideration of CCF is not relevant.”

CCF is another new concept from ISO 13849-1:2007, and stands for “Common Cause Failure”. I’m not going to get into this in any detail here, but suffice to say that design techniques, as well as channel separation (impossible in a single channel architecture) and other techniques are used to reduce the likelihood of CCF in higher reliability systems.

Performance Levels – PL

“The maximum PL achievable with category B is PL = b.”

PL stands for “Performance Level.” FIve Performance Levels have been defined from ‘a’ to ‘e’. The Performance Levels represent bands or groups of failure rates expressed as the fractional probability of failure per hour.

For example, PLa, the band with the highest probability of failure per hour, includes an average probability of dangerous failure per hour of >= 10-5 to < 10-4 failures per hour. The fractional failure rate is referred to as the Probability of Dangerous Failure per Hour (PFHd). To convert PFHd to something a bit easier to understand, you can convert PFHd to years-to-failure using the following calculations. I’m going to assume that the control system is operating 24/7/365, but by adjusting the number of hours in the year for other operating periods you can adjust the result. See below.

\tag{1}
\frac{1\times10^{-4}}{\text{hours}}\times\frac{8760\:\text{hours}}{\text{year}}=0.876\:\text{failures per year}

Now that we know how many failures per year we’re dealing with, we need to convert to the number of years to failure.

\tag{2}
\frac{1}{0.86\:\text{years}}=1.142\:\text{years-to-failure}

What this means is that the probability of experiencing failure in a PLa system can reach 100% in as little as 1.142 years. We can convert years-to-failure to hours-to-failure by multiplying the years by 8760.

\tag{3}
\frac{1.142\:\text{years}}{\text{failure}}\times\frac{8760\:\text{hours}}{\text{year}}=10,004\:\text{hours-to-failure}

Let’s calculate the other limit for the PLa band.

\tag{4}
\frac{1\times10^{-5}}{\text{hours}}\times\frac{8760\:\text{hours}}{\text{year}}=0.0876\:\text{failures per year}

Since we moved by one factor of magnitude smaller (10-4 to 10-5), it makes sense that the failure rate got smaller by that same amount. Calculating the years-to failure we get:

\tag{5}
\frac{1\:\text{failure}}{0.0876\:\text{year}}=11.42\:\text{years-to-failure}

PLb is equal to >= 3 × 10-6 to < 10-5 failures per hour. Calculating the lower limit we get:

\tag{6}
\frac{3\times10^{-6}}{\text{hours}}\times\frac{8760\:\text{hours}}{\text{year}}=0.02628\:\text{failures per year}
\tag{7}
\frac{1\:\text{failure}}{0.02628\:\text{years}}=38.05\:\text{years-to-failure}
\tag{8}
\frac{38.05\:\text{years}}{\text{failure}}\times \frac{8760\:\text{hours}}{\text{year}}=333,333\:\text{hours-to-failure}

The upper limit of the PLb band is the same as the lower limit of the PLa band, so I won’t calculate that again.

While 38 years to failure sounds like a lot, it’s important to bear in mind that that is simply the point in time when the probability of failure hits 100%. You can have a failure occur the first time you use the safety function, or not have it fail until 38 years from the first time the function is used. Some machines may run considerably longer than that before a failure occurs. To get an idea about why that can happen, have a look at the bathtub curve and what it means for product life. When dealing with the probability of a safety function failing, these numbers represent some pretty high failure rates.

If you consider an operation running a single shift in Canada where the normal working year is 50 weeks and the normal workday is 7.5 hours, a working year is

\tag{9}
\frac{7.5\:\text{hours}}{\text{day}}\times\frac{5\:\text{days}}{\text{week}}\times\frac{50\:\text{weeks}}{\text{year}}=1875\:\text{hours/year}

Taking the failure rates per hour above, yields:

PLa = one failure in 5.3 years of operation to one failure in 53.3 years of operation

PLb = one failure in 53.3 years of operation to one failure in 177.8 years of operation.

If we go to an operation running three shifts in Canada, a working year is:

\tag{10}
\frac{7.5\:\text{hours}}{\text{shift}}\times3\:\text{shifts}\times\frac{5\:\text{days}}{\text{week}}\times\frac{50\:\text{weeks}}{\text{year}}=5625\:\text{hours per year}

Taking the failure rates per hour above and recalculating, this yields:

PLa = one failure in 1.8 years of operation to one failure in 17.8 years of operation

PLb = one failure in 17.8 years of operation to one failure in 59.25 years of operation

Except for the least hazardous machines, I can’t imagine too many employers that would be happy with a safety function on a machine that failed within two years from new!

Now you should be starting to get an idea about where this is going. It’s important to remember that probabilities are just that – the failure could happen in the first hour of operation or at any time after that, or never. These figures give you some way to gauge the relative reliability of the design and ARE NOT any sort of guarantee.

Watch for the next post in this series where I will look at Category 1 requirements!

References

[1] Safety of Machinery – Safety Related Parts of Control Systems – Part 1: General Principles for Design. CEN Standard EN 954-1. 1996.

[2] Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. ISO Standard 13849-1. 2006.

[3] Safety of machinery — Safety-related parts of control systems — Part 2: Validation, ISO Standard 13849-2. 2003.

[4] Safety of machinery — Safety-related parts of control systems — Part 100: Guidelines for the use and application of ISO 13849-1. ISO Technical Report TR 100. 2000.

[5] Safety of machinery — Safety-related parts of control systems — Part 1: General principles for design. CEN Standard EN ISO 13849-1. 2008.

Some Rights Reserved

13 thoughts on “Interlock Architectures – Pt. 1: What do those categories really mean?”

1. controlsgirl says:

Great explanation and translation into how these standards are applied in the real world. One thing that I think is often confusing is the definition of failure. I often have found myself wondering if the standard means failure of a device to work as expected or failure in the sense that someone had to press an e-stop..etc. Could you help clarify that in the different places that the word is mentioned? I know that sometimes it is more obvious than others.

1. Hey controlsgirl! Thanks for posting this question – it’s a good one.

When we’re talking about safety related controls there are a number of different types of failures we could be talking about. From the perspective of ISO 13849-1, what we care about are dangerous failures, meaning that the safety-related control function has failed in a way that immediately increases the risk to the operator. If a control function doesn’t work as expected, but no increase in risk occurs, it’s not a dangerous failure. If a dangerous failure occurs in a guard interlock, the result could be a situation where the operator opens the guard and the machine fails to stop. That is a dangerous failure.

To sum up, failures as discussed in ISO 13849-1 are always faults in the safety-related parts of the control system that result in an increase in risk to the operator. They may be dangerous-detected failures, or dangerous-undetected failures. The standard doesn’t pay any attention to safe failures, detectable or not.

Emergency stop is there to deal with ’emergent’ conditions, i.e., failures that weren’t foreseen by the designer, and so aren’t dealt with by the automatic safety functions designed into the machine. For example, a ‘silent’ failure occurs in the guard interlock we were talking about. ‘Silent’ means the control system diagnostics don’t detect it for whatever reason. The operator opens the guard and is immediately and unexpectedly exposed to the machine hazard, resulting in an injury. A co-worker presses the emergency stop to try to limit any additional harm that might occur, and then dials 911 (or 112, or whatever your local emergency phone number is). E-stops are considered ‘complementary protective measures’ because they complement the primary safeguards, like the guard interlocks.

I think that covers it. Let me know if you have any more questions!

1. Frank Bardoul says:

Hi Doug, I’m a certified worker Health and safety rep for USW 6571. I work for Gerdau Whitby Steel Mill. I am by no means technically savvy, shall we say. I just have a great sense of duty to make sure my brothers are safe when they are running their respective equipment. Currently, the company has a Category 3 Safety system installed in our Bar Mill finishing end. It consists of a Safety PLC Control Box with a Stop button and a Kirk key. You hit the stop button, wait, turn the kirk key and place key in lock box, and place personal lock on the box. This Control box and lock box is located in the operator’s control pulpit. Now, the operator can open the gates (Which will only open if the control box in the pulpit has the stop button depressed, key turned and removed from control box) The worker reps have maintained this system is NOT the equivalent or better than a hard lockout (Physical lock on power source) and as such we use this safety system to quickly access the equipment to fix a minor problem, say, where we need to nudge or adjust a guide plate or cut a bar with a torch, so long as the operator doesn’t need to get up close and personal with the equipment. We demand a full hard lockout if millwrights or electricians or operators are required to get any part of their body too close or wrapped around any of the equipment. The company is now thinking of using a similar system in the rolling mill to take away power from the mill stands, or maybe using a category 4 system and using it as a standalone lockout in place of a hard physical lockout. I’m not sure yet. The company is doing a risk assessment tomorrow, with the engineering firm and has asked myself and my fellow JHSC members to attend and ask lots of questions. I have nothing but questions. My questions for you are, 1. Is a category 3 safety device the equivalent or better than a typical lockout/tagout procedure? 2. Is there a safety PLC system that is the equivalent or better than a typical lockout/tagout procedure. LOL… I’m quite sure the answer won’t be so cut and dry. But here’s hoping! I’m also interested in going to the seminar in Cambridge on May 9, 2018. Hopefully that will shed some more light. Thanks Doug. I know it’s a little long winded.

1. Frank,

There is way more to unpack in your comment than I think I can do in this space. I would be happy to discuss this with you by phone tomorrow if you would like to do that. Also, these comments are public, and there may be discussions which are better kept private. I would be more than happy to discuss this with you by phone. I will contact you via email with my contact details.

Doug

1. Frank Bardoul says:

Doug, that would be greatly appreciated. I’m in our safety office at 8:00am to discuss with the other guys what our objectives will be when we listen to this risk assessment at 9:00am. So, I’ll shoot for early afternoon for a phone cal if that’s good with you.

This site uses Akismet to reduce spam. Learn how your comment data is processed.