Analog Angle Article

When failure IS an option

(The original version of this column appeared in the August 21, 2006 issue of EE Times)

Prudent and responsible design means taking into account the potentially serious failure modes of your product, and providing protection against them. It sounds straightforward, prudent, and unarguable.

But as with so many other aspects of design, it's much easier said than done. Designers must wrestle with all sorts of conflicting challenges, as they try to protect the product against failures that could have consequences ranging from mere embarrassment to life threatening, as the battery recall demonstrates. After all, the average consumer does not associate laptop batteries with starting fires.

Most serious problems are not the result of just one thing gone wrong, nor are they due to truly independent double failures, which are actually extremely rare. Instead, they occur when there is an unanticipated sequence of events, with problem A causing problem B, or problem A not seen by circuit B.

The other challenge is that many failure modes are so rare that no practical amount of testing on a sample of units will reveal them. That's why cars and laptops get recalled years after they ship—the law of large numbers kicks in and provides such a large number of units under life test (admittedly in the user's hands) that you can finally get the failures which occur at the tail of the statistical probability curve. As they say, if something can go wrong, it eventually will–except that “eventually” can be a long time, and we're not sure what that “it” is.

To what extent do you take into account substandard, defective, or even counterfeit components? What if you are monitoring a counterfeit battery pack's temperature and the battery charges improperly, but its internal temperature sensor is also substandard and giving false readings? You can't ignore software-sourced failures, either: how do you guard against subtle, rarely seen problems caused by incorrect or unvoiced assumptions in the design phase, as well as outright coding mistakes?

Finally, there's the question that plagues all protection architectures: who watches the watchers? What if that circuitry itself is defective or fails, and gives no indication of this? Is redundant circuitry your design alternative? Should it be of the same type, or of a different design for further assurance? Or, do you use three circuits monitoring in parallel, with some sort of voting mechanism, and what if that mechanism fails? Pretty soon, you're not doing engineering design; you're debating philosophy and issues of faith.

The first line of defense is usually to use simple components, such as comparators, temperature sensors, and limit switches, to protect against failures in complex circuitry and software, but this may not be enough. Yet if you're not able to say “stop,” you'll end up with a mil/aerospace design, but for a consumer pricing model.

Bill Schweber , Site Editor, Planet Analog

0 comments on “When failure IS an option

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.