When Things Get Very Complex, Is Six-Sigma Enough?

To follow up on one of my previous blogs (Component Count & Reliability), let's dig into the numbers a bit more. We'll use a simple example: Consider a device that has just 1,000 things that can go wrong with it — not even as complex as a 4 bit MCU. Let's say that each thing has a 0.1 percent probability of going wrong per hour — what does this tell you?

Well, it tells you that statistically, there will be something wrong every hour you try and use it! Obviously, not a very happy experience for you if it were a WiFi router, cellphone, or any other modern convenience.

We still often see advertisements for six-sigma quality people on sites like Monster or LinkedIn — these mostly have to do with manufacturing products. On a previous job, I was in charge of support engineering for wireless phone production for a line that was turning 200k phones a month — a slow mover. It was manufactured to seven-sigma for the design. First Pass Yield at final test was nothing like seven-sigma, and usually, fall-out and early returns ran in the double-digit percentages. Why is this?

With an ASIC that included two fast, low-power DSPs, and a fast ARM processor plus a large amount of cache and RAM, as well as almost as much flash as the address space of a high end PC of the late 1980s, and complex mixed-signal ASICs for RF and baseband, it was a very complex product boiled down to about half a dozen ICs and RF semi-conductors.

We got many chipsets that would pass all tests and even V&V (verification & validation) qualifications, and then fail one function or another in a full-up phone. Escapes from all the silicon vendors testing — we would re-ball the chips put them back in the problem phone and that one function would not work — was always exercised in some way that was different than ATE did it, and Qualification test did it. Thousands of prototypes had been built, tested, and alpha- and beta-tested all over the country, and they failed to uncover this small flood of “ankle biter” problem phones that would come across my desk.

Modern FPGAs and SOCs are exceeding this. There are very complex mixed-signal ASICs and SOCs on the market in high volume products. Modern PC processors exceed 1 billion transistors! ICs aren't built on perfect processes in perfect clean-rooms — they are good enough to do the job well so yield is not too bad at wafer probe, and at test, and are a good economical solution in a high volume market.

In making an IC, the process can drift or get towards the marginal side, and drive yield down and escapes up. Also, impurities are always present in the clean-room. The IC is really like a 20-layer circuit board, on top of a bunch of components formed in the board material. Any little speck of stuff in the clean room air can narrow a line, or widen a line, weaken a via or drive up its resistance. They can also alter the N and P in the features, causing a device to switch poorly in a digital gate, or have altered parameters in an analog feature as well as other “bad” things.

Given a billion transistors, and about three billion Interconnects in a PC processor, is nine-sigma even enough, given each of these four billion items can have many parameters that can fail in many different ways? What is your experience? Some companies even will sell you chips that don't pass all the tests, as long as they work with your present version of code for less money. What is your take on all this six sigma stuff?

Related posts:

8 comments on “When Things Get Very Complex, Is Six-Sigma Enough?

  1. BillWM
    July 16, 2013

    Standards like DO-254, DO-178B, FAA-8110.105 etc tell one that one must meet a 10^-9 probablility for loss of the aircraft for each item in a system fault tree — the issue can be if the system is so complex there are hundreds of thousands of different “Ankle Biter” things that can cause this at a low probability the aircraft can be lost more frequently than 10^-9 /hr due to the extreme complexity of the design.

  2. BillWM
    July 17, 2013

    Often full-up SPICE simulations of SOC (System on Chip) I.C.'s cannot be done due to the complexity — a very real issue is that at today's clock rates and gate counts the power rails do not behave perfectly and a “digital” simulation is based on spice analisys of smaller sections of the design.  Rare current “Slug” events on power rails that are marginal in some way are difficult to find even in board level V&V as a reference design may not perfectly match a customer's application for a complex device.   The other item to consider is the IC test equipment has a special socket and very robust board layout for production test, whereas the use in the field may be shaved down to a 4 layer board, from the 10-20 in the test station.  This all can cause returns for SOC and other complex devices that are “Analog” failures more than “Digital” failures — and DO-254/DO-178B systems do not run the ATE code, nor do they do the full Qual test on every unit shipped.   They try and set margin in manufacturing test, but often there is a good, a maybe, a bad, and an ugly bin in the production test flow — the good ones are saved for the good customers, the maybe ones maybe shipped somewhere where they drive wooden trucks and don't mind so much, and the bad, fixed, and retested.  The ugly are beyond repair.

  3. Brad_Albing
    July 18, 2013

    Jeeze – you're making me want to not fly.

  4. BillWM
    July 18, 2013

    Only fly on the “Good” ones

  5. Brad_Albing
    July 18, 2013

    Of course – I should have thought of that.

  6. Brad_Albing
    July 18, 2013

    I think that's a selection on Expedia: “Safe Plane – check Y or N”

  7. Brad_Albing
    July 22, 2013

    @WM – shipped somewhere where they drive wooden trucks – I'm stealing that line and using it. Thanks.

  8. nanonical
    October 16, 2014

    Barring component failures, often it's noise …… What I've learned as a better way in the design phase is:

    Extract steady-state current noise for (more manageable) individual IC blocks on a rail.  Convolve together.  Do this for > 10 vectors of sufficient length.  Next, take convolution and get estimated PSD.  Then std methods can estimate +3 sigma peak.  Scale convolution result up to this.  This avoids lengthy (5+ day) sims for IC or total PDN.  Post-layout IC sim using the result to get failure rate.  Iterate IC if necessary. Also, gets the bump and ball Vnoise spec.  Next, do same “std methods” with pcb, incl. dcaps, no pwr filter, VRM at gnd, using peak Inoise above.  This is the PDN result.  Tune pcb layout & dcaps to meet budget.  Last, a separate pcb time-domain sim using IC true I-transients @balls, incl. VRM filter, VRM model and other VRM-side noise sources; this gets the true transient cases.  Look at resultant ball Vnoise vs. IC transient budget.  Separately, pkg and pcb layout density can force dc analysis, esp. in pcb case with multiple ICs in a tight space & on a rail w/ one VRM sense.  The above isn't done, in my experience.  Worse, whatever's done is close to tapeout.  They never learn.  Too busy trying to sell jelly beans instead of chips.  Belief in the magic transformation of chicken poop into chicken salad.  To make better yield & a reliable spec, a comprehensive statistical approach is needed in each domain.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.