Archive for the ‘Control Reliability’ Category

Interlock Architectures – Pt. 3: Category 2

Tuesday, August 24th, 2010

In the first two posts in this series, we looked at Category B, the Basic category of system architecture, and then moved on to look at Category 1. These two categories of system architecture underpin Categories 2, 3 and 4. In this post we’ll look more deeply into Category 2.

Let’s start by looking at the definition for Category 2, taken from ISO 13849-1:2007. Remember that in these excerpts, SRP/CS stands for Safety Related Parts of Control Systems.

Definition

6.2.5 Category 2

For category 2, the same requirements as those according to 6.2.3 for category B shall apply. “Well–tried safety principles” according to 6.2.4 shall also be followed. In addition, the following applies.

SRP/CS of category 2 shall be designed so that their function(s) are checked at suitable intervals by the machine control system. The check of the safety function(s) shall be performed

  • at the machine start-up, and
  • prior to the initiation of any hazardous situation, e.g. start of a new cycle, start of other movements, and/or
  • periodically during operation if the risk assessment and the kind of operation shows that it is necessary.

The initiation of this check may be automatic. Any check of the safety function(s) shall either

  • allow operation if no faults have been detected, or
  • generate an output which initiates appropriate control action, if a fault is detected.

Whenever possible this output shall initiate a safe state. This safe state shall be maintained until the fault is cleared. When it is not possible to initiate a safe state (e.g. welding of the contact in the final switching device) the output shall provide a warning of the hazard.

For the designated architecture of category 2, as shown in Figure 10, the calculation of MTTFd and DCavg should take into account only the blocks of the functional channel (i.e. I, L and O in Figure 10) and not the blocks of the testing channel (i.e. TE and OTE in Figure 10).

The diagnostic coverage (DCavg) of the total SRP/CS including fault-detection shall be low. The MTTFd of each channel shall be low-to-high, depending on the required performance level (PLr). Measures against CCF shall be applied (see Annex F).

The check itself shall not lead to a hazardous situation (e.g. due to an increase in response time). The checking equipment may be integral with, or separate from, the safety-related part(s) providing the safety function.

The maximum PL achievable with category 2 is PL = d.

NOTE 1 In some cases category 2 is not applicable because the checking of the safety function cannot be applied to all components.

NOTE 2 Category 2 system behaviour allows that

  • the occurrence of a fault can lead to the loss of the safety function between checks,
  • the loss of safety function is detected by the check.

NOTE 3 The principle that supports the validity of a category 2 function is that the adopted technical provisions, and, for example, the choice of checking frequency can decrease the probability of occurrence of a dangerous situation.

ISO 13849-1 Figure 10

ISO 13849-1 Figure 10 - Category 2 Block diagram

Breaking it down

Let start by taking apart the definition a piece at a time and looking at what each part means. I’ll also show a simple circuit that can meet the requirements.

Category B & Well-tried Components

The first paragraph speaks to the building block approach taken in the standard:

For category 2, the same requirements as those according to 6.2.3 for category B shall apply. “Well–tried safety principles” according to 6.2.4 shall also be followed. In addition, the following applies.

Systems meeting Category 2 are required to meet all of the same requirements as Categories B & 1 as far as the components are concerned. Other requirements for the circuits are different, and we will look at those in a bit.

Self-Testing required

Category 2 brings in the idea of diagnostics. If correctly specified components have been selected (Category B), and those components can be considered ‘well-tried’ and are applied following ‘well-tried safety principles’ (Category 1), then adding a diagnostic component to the system should allow the system to detect some faults and therefore achieve a certain degree of ‘fault-tolerance’ or the ability to function correctly even when some aspect of the system has failed.

Let’s look at the text:

SRP/CS of Category 2 shall be designed so that their function(s) are checked at suitable intervals by the machine control system. The check of the safety function(s) shall be performed

  • at the machine start-up, and
  • prior to the initiation of any hazardous situation, e.g. start of a new cycle, start of other movements, and/or
  • periodically during operation if the risk assessment and the kind of operation shows that it is necessary.

The initiation of this check may be automatic. Any check of the safety function(s) shall either

  • allow operation if no faults have been detected, or
  • generate an output which initiates appropriate control action, if a fault is detected.

Whenever possible this output shall initiate a safe state. This safe state shall be maintained until the fault is cleared. When it is not possible to initiate a safe state (e.g. welding of the contact in the final switching device) the output shall provide a warning of the hazard.

Periodic checking is required. The checks must happen at least each time there is a demand placed on the system, i.e. a guard door is opened and closed, or an emergency stop button is pressed and reset. In addition the integrity of the SRP/CS must be tested at the start of a cycle or hazardous period, and potentially periodically during operation if the risk assessment indicates that this is necessary.

The testing does not have to be automatic, although in practice it usually is. As long as the system integrity is good, then the output is allowed to remain on, and the machinery or process can run.

Watch Out!

Notice that the words ‘whenever possible’ are used in the last paragraph in this part of the definition where the standard speaks about initiation of a safe state. This wording alludes to the fact that these systems are still prone to faults that can lead to the loss of the safety function, and so cannot be called truly ‘fault-tolerant’. Loss of the safety function must be detected by the monitoring system and a safe state initiated. This requires careful thought, since the safety system components may have to interact with the process control system to initiate and maintain the safe state in the event that the safety system itself has failed.

All of this leads to an interesting question: If the system is hardwired through the operating channel, and all the components used in that channel meet Category B & 1 requirements, can the diagnostic component be provided by a monitoring the system with a standard PLC?

Unfortunately, the answer to this is NO. This is true because ALL of the components must meet the well-tried requirement, and since programmable electronics are specifically excluded from being considered well-tried, this approach cannot be used. Some North American standards are written so that this approach could be applied, but under the International and EU requirements it is not acceptable.

Finally, for the faults that can be detected by the monitoring system, detection of a fault must initiate a safe state. This means that on the next demand on the system, i.e. the next time the guard is opened or the emergency stop is pressed, the machine must go into a safe condition. Generally, detection of a fault should prevent the subsequent reset of the system until the fault is cleared or repaired.

Testing is not permitted to introduce any new hazards or to slow the system down. The tests must occur ‘on-the-fly’ and without introducing any delay in the system compared to how it would have operated without the testing incorporated. Test equipment can be integrated into the safety system or be external to it.

Watch Out!

Note 1 in the definition highlights a significant pitfall for many designers: if all of the components in the functional channel of the system cannot be checked, you cannot claim conformity to Category 2. A system that otherwise would meet the architectural requirements for Category 2 must be downgraded to Category 1 in cases where all the components in the functional channel cannot be tested. This is a major point and one which many designers miss when developing their systems.

Calculation of MTTFd

The next paragraph deals with the calculation of the failure rate of the system, or MTTFd.

For the designated architecture of category 2, as shown in Figure 10, the calculation of MTTFd and DCavg should take into account only the blocks of the functional channel (i.e. I, L and O in Figure 10) and not the blocks of the testing channel (i.e. TE and OTE in Figure 10).

Calculation of the failure rate focuses on the functional channel, not on the monitoring system, meaning that the failure rate of the monitoring system is ignored when analyzing systems using this architecture. The MTTFd of each component in the functional channel is calculated and then the MTTFd of the total channel is calculated.

The Diagnostic Coverage (DCavg) is also calculated based exclusively on the components in the functional channel, so when determining what percentage of the faults can be detected by the monitoring equipment, only faults in the functional channel are considered.

This highlights the fact that a failure of the monitoring system cannot be detected, so a single failure in the monitoring system that results in the system failing to detect a subsequent normally detectable failure in the functional channel will result in the loss of the safety function.

Summing Up

The next paragraph sums up the limits of this particular architecture:

The diagnostic coverage (DCavg) of the total SRP/CS including fault-detection shall be low. The MTTFd of each channel shall be low-to-high, depending on the required performance level (PLr). Measures against CCF shall be applied (see Annex F).

The first sentence reflects back to the previous paragraph on diagnostic coverage, telling you, as the designer, that you cannot make a claim to anything more than LOW DC coverage when using this architecture.

This raises an interesting question, since Figure 5 in the standard shows columns for both DCavg = LOW and DCavg=MED. My best advice to you as a user of the standard is to abide by the text, meaning that you cannot claim higher than LOW for DCavg in this architecture.

Another problem raised by this sentence is the inclusion of the phrase “the total SRP/CS including fault-detection”, since the previous paragraph explicitly tells you that the assessment of DCavg ‘should’ only include the functional channel, while this sentence appears to include it. In standards writing, sentences including the word ‘shall’ are clearly mandatory, while those including the word ‘should’ indicate a condition which is advised but not required. Hopefully this confusion will be clarified in the next edition of the standard.

Failure rates in the functional channel can be anywhere in the range from LOW to HIGH depending on the components selected and the way they are applied in the design. The requirement will be driven by the desired PL of the system, so a PLd system will require HIGH MTTFd components in the functional channel, while the same architecture used for a PLa system would require only LOW MTTFd components.
Finally, applicable measures against Common Cause Failures (CCF) must be used. Some of the measures given in Table F.1 in Annex F of the standard cannot be applied, such as Channel Separation, since you cannot separate a single channel. Other CCF measures can and must be applied, and so therefore you must score at least the minimum 65 on the CCF table in Annex F to claim compliance with Category 2 requirements.

Example Circuit

Here’s an example of what a simple Category 2 circuit constructed from discrete components might look like. Note that PB1 and PB2 could just as easily be interlock switches on guard doors as push buttons on a control panel. For the sake of simplicity, I did not illustrate surge suppression on the relays, but you should include MOV’s or RC suppressors across all relay coils. All relays are considered to be constructed with  ‘force-guided’ designs and meet the requirements for well-tried components.

Example Category 2 circuit from discrete components

Example Example Category 2 circuit from discrete components

Here is how the circuit works:

  1. The machine is stopped with power off. CR1, CR2, CR3 and M are off.
  2. The reset push button, PB3,  is pressed. If both CR1, CR2 and M are off, their normally closed contacts will be closed, so pressing PB3 will result in CR3 turning on.
  3. CR3 closes its contacts, energizing CR1 and CR2 which seal their contact circuits in and de-energizing CR3. The time delays inherent in relays permit this to work.
  4. With CR1 and CR2 closed and CR3 held off because its coil circuit opened when CR1 and CR2 turned on, M energizes and motion can start.

In this circuit the monitoring function is provided by CR3. If any of CR1, CR2 or M were to weld closed, CR3 could not energize, and so a single fault is detected and the machine is prevented from re-starting. If the machine is stopped by pressing either PB1 or PB2, the machine will stop since CR1 and CR2 are redundant. If CR3 fails, then the M rung is all held open because CR3 has not deenergized, preventing the machine from starting with a failed monitoring system. If CR1 or CR2 fail with an open coil, then M cannot energize because of the redundant contacts on the M rung.

This circuit cannot detect a failure in PB1, PB2, or PB3. Testing is conducted each time the circuit is reset.

If M is a motor starter rather than the motor itself, it will need to be duplicated for redundancy and a monitoring contact added to the CR3 rung unless a reasonable case for fault exclusion can be made.

In calculating MTTFd, PB1, PB2, CR1, CR2, CR3 and M must be included. CR3 is included because it has a functional contact in the M rung and is therefore part of the functional channel of the circuit as well as being part of the OT and OTE channels.

Download IEC standards, International Electrotechnical Commission standards.
Download ISO Standards

Watch for the next installment in this series where we’ll explore Category 3, the first of the ‘fault tolerant’ architectures!

Five things most machine builders do incorrectly

Friday, August 6th, 2010

The Top Five errors I see machine builders make on a depressingly regular basis:

1) Poor or Absent Risk Assessment

Risk assessments are fundamental to safe machine design and liability limitation, and are required by law in the EU. They are a included in all of the modern North American machinery safety standards as well.

Machine builders frequently have trouble with the risk assessment process, usually because they fail to understand the process or because they fail to devote enough resources to getting it done.

If risk assessment is built into your design process, it becomes the norm for how you do business. Time and resources will automatically be devoted to the process, and since it’s part of how you do things it will become relatively painless. Where people go wrong is in making it a ‘big deal’ one-time event. Also getting it done early in the design process and iterated as the design progresses means that you have time to react to the findings, and you can complete any necessary changes at more cost-effective points in the design and build process. The worst time to do risk assessment is at the point where the machine is on the shop floor ready to start production. Costs for modification are then exponentially higher than during design and construction.

Poorly done, risk assessments become a liability defense lawyer’s worst nightmare and a plaintiff’s lawyer’s dream. Shortchanging the risk assessment process ensures that you will lose, either now or later.

Fight this problem by: learning how to conduct a risk assessment, using quality risk assessment software tools, and building risk assessment into your standard design process/practice in your organization.

2) Failure to be Aware of Regulations & Use Design Standards

This one is a mystery to me.

Every market has product safety legislation, supported by regulations. Granted, the scope and quality of these regulations varies widely, but if you want to sell a product in a market, it doesn’t take a lot of effort to find out what regulations may apply.

Design standards have been in existence for a long time. Most purchase orders, at least for custom machinery, contain lists of standards that the equipment is required to meet at Factory Acceptance Testing (FAT).

Why machine builders fail to grasp that using these standards can actually give them a competitive edge, as well as helping them to meet regulatory requirements, I don’t know. If you do, please either comment on this story or send me an email. I’d love to hear your thoughts on this!

Fight this problem by: Doing some research. Understand the market environment in which you sell your products. If you aren’t sure how to do this, use a consultant to assist you. Buy the standards, especially if your client calls them out in their specifications. Read and apply them to your designs.

One great resource for information on regulatory environments and standards applications is the IEEE Product Safety Engineering Society and the EMC-PSTC Listserv that they maintain.

3) Fixed Guard Design

Fixed guarding design is driven by at least two factors, a) preventing people from accessing hazards, and b) allowing raw materials and products into and out of the machinery.

Designers frequently go wrong by selecting a fixed guard where a movable guard is necessary to permit frequent access (say more than once per shift). This is sometimes done in an effort to avoid having to add interlocks to the control systems. Frequently the guard will be removed and replaced a couple of times, and then the screws will be left off, and eventually the guard itself will be left off, leaving the user with an unguarded hazard.

The other common fault with fixed guards relates to the second factor I mentioned – getting raw materials and products in an out of the machine. There are limits on the size of openings that can be left in guards, dependent on the distance from the opening to the hazards behind the guard and the size of the opening itself. Often the only factor considered is the size of the item that needs to enter or exit the machinery.

Both of these faults often occur because the guarding is not designed, but is allowed to happen during machine build. The size and shape of the guards is then often driven by convenience in fabrication rather than by thoughtful design and application of the minimum code requirements.

Fight this problem by: Designing the guards on your product rather than allowing them to happen, based on the outcome of the risk assessment and the limits defined in the standards. Tables for guard openings and safety distances are available in North American, EU and International standards.

4) Movable Guard Interlocking

Movable guards themselves are usually reasonably well done. Note that I am not talking about self adjusting guards like those found on a table saw for instance. I am talking about guard doors, gates, and covers.

The problem usually comes with the design of the interlock that is required to go with the movable guard. The first part of the problem goes back to my #1 mistake: Risk Assessment. No risk assessment means that you cannot reasonably hope to get the reliability requirements right for the interlocking system. Next, there are small but significant differences in how the Canadian, US, EU and International standards handle control reliability, and the biggest differences occur in the higher reliability classifications.

In the USA, the standards speak of control reliable circuits (see ANSI RIA R15.06-1999, 4.5.5). This requirement is written in such a way that a single interlocking device, installed with dual channel electrical circuits and suitably selected components will meet the requirements. No single ELECTRICAL component failure will lead to the loss of the safety function, but a single mechanical fault could.

In Canada, the machinery and robotics standards speak of control reliable systems (see CSA Z432, 8.2.5), not circuits as in the US standards. This requirement is written in such a way that TWO electromechanical interlocking devices are required, one in each electrical channel of the interlocking system. This permits the system to detect mechanical failures such as broken or missing keys, and if different types of interlocking devices are chosen, may also permit detection of efforts to bypass the interlock. Most single mechanical faults and electrical faults will be detected.

In the EU and Internationally, control reliability is much more highly developed. Here, the application of ISO 13849, IEC 62061 or IEC 61508 have taken control reliability to higher levels than anything seen to date in North America. Under these standards, the required Performance Level (PLr) or Safety Integrity Level (SIL) must be known. This is based on the outcome of, you guessed it, the Risk Assessment. No risk assessment, or a poor risk assessment, dooms the designer to likely failure. Significant skill is required to handle the analysis and design of safety related parts of control systems under these standards.

Fight this problem by: Getting the training you need to properly apply these standards and then using them in your designs.

5) Safety Distances

Safety distances crop up anywhere you don’t have a physical barrier keeping the user away from the hazard. Whether its an opening in a fixed guard, a movable guard like a guard door or gate, or a presence-sensing safeguarding device like a light curtain, safety distances have to be considered in the machine design. The easier it is for the user to come in contact with the hazard, the more safety distance matters.

Stopping performance of the machinery must be tested to validate the safety distances used. Failure to get the safety distance right means that your guards will give your users a false sense of security, and will expose them to injury. This will also expose your company to significant liability when someone gets hurt, because they will. Its only a matter of time.

Fight this problem by: Testing safeguarding devices.

6) Validation

OK, so this list should really be SIX things. Just consider this to be a bonus for reading this far!

Designs, and particularly safety critical designs, must be tested. Let me say it again:

Safety Critical Designs MUST Be Tested.

Whatever theory you are working under, whether it’s North American, European, International or something else, you cannot afford missing the validation step. Without validation you have no evidence that your system worked at all, let alone if it worked correctly.

Fight this problem by: TESTING YOUR DESIGNS.

A wise man once said: “If you think safety is expensive, try having an accident.” The gentleman was involved in investigating the crash of a Sikorski S-92 helicopter off the coast of Newfoundland. 17 people died as a result of the failure of two titanium studs that held an oil filter onto the main gearbox, and the fact that the helicopter failed the ’1/2-hour gearbox run-dry test’ that is required for all new helicopter designs. This was a clear case of failure in the risk assessment process complicated by failure in the test process.

Watch the CBC documentary “Cougar 491“. This is definitely worth the time. If you are located outside Canada, you will have a problem with this link. Unfortunately, CBC does not stream it’s video outside Canada. Sorry.


Bad Behavior has blocked 101 access attempts in the last 7 days.

leader