Archive for root cause failure analysis

FMEA or Failure Analysis?

Posted in Uncategorized with tags , , , , , on December 22, 2016 by manufacturingtraining
A recent FMEA training program presented to defense industry engineers in Singapore

A recent FMEA training program presented to defense industry engineers in Singapore.

A question we frequently hear from clients and potential clients is this:   Should we train our engineers in FMEA or failure analysis?   It’s a good question.  Let’s consider each.

FMEA is an abbreviation for Failure Modes and Effects Analysis.   It’s also referred to as FMECA (an abbreviation for Failure Modes, Effects, and Criticality Analysis).   The two (FMEA and FMECA) are the same thing.   An FMEA (or FMECA) is usually prepared during the product or process development and implementation phase.  It is designed to identify all possible failure modes of each part of the product or process, the effects of the failure, the criticality of the failure, and actions that can be taken to prevent occurrence.   These analyses tend to be more general in nature than a root cause failure analysis (which I’ll explain in a moment).   The idea is to identify what could go wrong (before failures occur), and incorporate appropriate actions to prevent failures from occurring.   FMEA is primarily a risk management tool (and it’s a good one).  We’ll have another blog up soon focused specifically on FMEA preparation, so keep an eye on this site.

Our approach for FMEA/FMECA preparation provides a comprehensive and quantitative risk management product and process failure mode identification analysis. The approach identifies key risks and suggested risk mitigation measures, along with a mean time between failures prediction.

Our approach for FMEA/FMECA preparation provides a comprehensive and quantitative risk management analysis. The approach identifies key risks and suggested risk mitigation measures, along with a mean time between failures prediction.

Root cause failure analysis is a more focused discipline applied once a failure has occurred. Its purpose is to identify the causes of a specific failure (again, a failure that has already occurred or is recurring in a product or manufacturing process).  The intent is to define the failure, identify all possible causes, objectively and systematically evaluate each potential cause, converge on the most likely causes, and then implement appropriate corrective and preventive actions.   Many folks think engineers automatically (by virtue of their technical background and training) know how to analyze failures; anyone who runs a manufacturing or development effort knows this is not the case.  Failure analysis is not always intuitive.  It has to be done in a systematic, objective, and rapid manner to identify all potential failure causes for a specific failure, and then rapidly bore in on the actual cause.   Failure to take this comprehensive approach is the primary reason many failures recur.

Systems Failure Analysis, the best book of its kind for guidance in organizing and managing a root cause failure analysis.

Systems Failure Analysis, the best book of its kind for guidance in organizing and managing root cause failure analyses in complex systems and processes.

Engineers receive no training during their engineering undergraduate or graduate education in this critical area, and most engineers don’t know how to identify and correct root causes.  We’ve taught our comprehensive root cause failure analysis training program to many companies and military organizations in Israel, Turkey, Canada, Mexico, China, Thailand, Singapore, Barbados, and the United States, and our Systems Failure Analysis text is recognized as the source document for analyzing complex system failures.

Both technologies (FMEA and root cause failure analysis) are critically important.  We offer focused onsite training in both areas.   We tailor our training to your specific needs; we’ll never ask a you to make your needs fit our solution.  Feel free to contact us for more information at or by calling 909 204 9984.

Meaningful Manufacturing Metrics

Posted in Manufacturing Improvement with tags , , , , on June 29, 2014 by manufacturingtraining

I question I frequently hear from clients is this:

     What metrics should I use for managing manufacturing?

The answer depends on the nature of your business.   Whatever you do, though, your metrics should meet these paramount requirements:

  • Your metrics should convey an honest sense of how the business is doing at a level that can be influenced by those who see the metric.
  • Your metrics should be posted where the performance is being measured.
  • The meaning of your metrics should be clear (simple is better).
  • Your metrics should be prepared by the folks doing the work.

What I usually see in client facilities are artfully-crafted Excel or PowerPoint plots that purport to show company performance.   It’s always at the company level, and the chartsmanship is always impressive.   Not the contents or the information contained in the charts, mind you, but the charts are beautiful   There must be an army of folks out there earning good livings churning out charts using everything MSOffice has to offer.   No kidding…the charts are awesome.  There are usually lots of charts in a central location (it seems like there are always more than a dozen, sometimes many more than that).   Like I said, they’re beautiful…a true testament to the capabilities of Excel and PowerPoint.

Usually, I’m the only one examining the MSOffice artistry…I never see anyone else examining them.   If you’re smiling while visualizing this image and my comments, consider this:   I often stop the next person who walks by (it doesn’t matter if it’s the CEO or a machine operator) and I ask this question:   What do the charts mean?   If it’s a productivity chart, I ask how it’s calculated.   If it’s on time delivery performance, I’ll ask how they measure it.  I can pick any chart on the wall, and after an embarrassed silence, the response is always the same:   I’m not sure.

I’m going to suggest just three metrics that I know will make a difference in your organization’s profitability and on time delivery performance:

  • Shipments Against Plan
  • Percent of Work Orders Completed On Time
  • MRB Aging

Let’s consider each of these.Shipments Against Plan

The first one, shipments against plan, is a monthly x-y plot that shows a cumulative shipping plan (in dollars) for the month, with another line showing actual shipments (again, in dollars).   Here’s what it looks like:


The beauty of the above metric lies in several areas:

  • If the product area manager prepares it (and I’ve always made that be the case in any manufacturing organization I’ve ever managed), they know every day where the shipments are with respect to the contact due date.  They don’t have to wait until the end of the month to find out where they are.
  • If you base the dollars on the product values and their contract due dates, you get a true plan of what the shipments (both planned and actual) should look like.  I often hear manufacturers claim they have to plan by revenue rather that the product due dates, but that’s a mistake from several perspectives (and it will the subject of a future blog entry).  A bit of a prelude on that one:  If you start pulling in anything you can to make the monthly sales figure (i.e., ship product earlier because it’s closer to being ready to ship than what is actually due), you’re doing serious damage to next month’s shipping schedule.  Like I said, more on this topic later.
  • If you put this metric in the factory in the final assembly area (and especially if the product manufacturing manager is located in this area), everyone sees exactly where the company is.
  • It avoids the typical “hockey stick” shipping profile, where little goes out of the factory during the first three weeks of the month, and there’s a mad dash to ship everything during the last week of the month.

Percent Of Work Orders Completed On Time

This is another simple chart, and it’s one that should be prepared for and prominently displayed in every work center in your factory.  It looks like this:


The premise here is that something or someone assigns work orders to each work center, and that the assignment includes a required completion date.   In companies with an MRP or ERP system, it’s usually called the “dispatch report” or the “to do” list.  It almost goes without saying, but I’ll say it anyway:  If the company is to deliver its products on time, each work center must strive to complete its dispatch-report-assigned work orders on time.

The metric here is simple:  It just shows the percent of work orders the work center completes on schedule each day.    The work center supervisor should prepare it at the end of the day, and post it in a prominent location so the folks assigned to each work center know how they’re doing.   It used to take me no more than 5 minutes to do this.  It was the essence of what I was being paid to do (manage the work center).

The beauty of this metric is that it is simple (everyone in the work center will understand it), it only takes a few minutes each day to prepare, and it naturally encourages the work center to improve performance.   Hitting that 100% on time in the work center is manageable and achievable.

Sometimes I hear folks tell me this:  We can’t do this because everything in the work center is late, so there’s no way we can hit 100%.  If that’s the case in any of your work centers, you need to replan the work.   That’s important for several reasons, the most significant of which is that the master production schedule should define who needs to do what and by when.  If your master production schedule doesn’t assign the work order completion dates and everything (or nearly everything) in the work center is late, the folks in the work center will decide which jobs they work.   That’s not a formula for success.

MRB Aging

The exhortations about 6 Sigma and other management fads du jour notwithstanding, anyone who’s ever worked in a manufacturing company knows that nonconformances occur.   Yes, we want robust processes and we’d like to have zero defects, but I’ve never been a factory that doesn’t experience rejections (and I’ve been in a lot of factories).  What governs our success is how we respond to them.

When items are rejected, they enter a material review process that determines nonconformance disposition:  Should the nonconforming item be scrapped, reworked, repaired, or used as is?

The above is interesting and you could write a book about the nuances associated with managing nonconforming material (I know because I actually did write a book that addresses this topic).   In my experience, strong root cause corrective action is essential for the obvious reasons (please see our Root Cause Failure Analysis training program), and so is rapid nonconformance disposition for a less obvious reason I’ll get to in a second.   Root cause failure analysis means finding out why the nonconformance occurred and taking steps to preclude recurrence.

Nonconformance disposition means what we do with the nonconforming item, and whatever we do, it’s important that we do it quickly.   Very quickly, in fact.  From a delivery performance perspective here’s a little known fact with a huge impact:   Stuff in MRB is invisible to MRP.   The MRP system thinks the rejected items in MRB are still available.   What that means to us is this:   When rejected items languish in MRB, they interfere with on time deliveries.   Items in MRB need to be rejected rapidly.   In the plants I’ve managed, I’ve put that limit at 1 day.   I’ve scrapped stuff that was hanging around too long even it could be reworked.   My reasoning was that I was in a better position letting MRP know the material was gone so we could get on with fabricating replacement material.   It drove the Materials folks nuts, but I only had to do it a couple of times before they became world-class proponents of dispositioning rejected material in less than a day.

This brings us to the third metric, and that’s a simple list of what’s in MRB, with a requirement that anything in there for more than 24 hours was highlighted in red.   You can set up an Excel spreadsheet with conditional formatting to check the time something entered MRB to the current time, and highlight it automatically if it goes over 1 day.   I posted that list outside the MRB bond area so that anyone walking by the area (which always included me at least once daily) could immediately see if things were growing whiskers in there.   It worked well.

MRP Aging

If your company is not delivering on schedule, the above metrics will have a rapid impact on highlighting where it hurts and where the improvement opportunities lie.   It’s a great start at putting control of the factory in the hands of the folks who can make a difference in how your plant performs.   There’s much more to getting on schedule and staying there, of course, but the above is a good start.

If you’d like to learn more about on time delivery performance, please pick up a copy of Manufacturing Delivery Performance Improvement, available from Amazon.


If you have any questions or suggestions, please give us a call at 909 204 9984; we’d love to hear from you.



Applying Taguchi to Load Development

Posted in Manufacturing Improvement with tags , , , , , , , on August 4, 2013 by manufacturingtraining

This blog entry describes the application of the Taguchi design of experiments technique to .45 ACP load development in a Smith and Wesson Model 25 revolver.


Taguchi testing is an ANOVA-based approach that allows evaluating the impact of several variables simultaneously while minimizing sample size.  This is a powerful technique because it allows identifying which factors are statistically significant and which are not.   We are interested in both from the perspective of their influence on an output parameter of concern.

Both categories of factors are good things to know.  If we know which factors are significant, we can control them to achieve a desired output.   If we know which factors are not significant, it means they require less control (thereby offering cost reduction opportunities).

The output parameter of concern in this experiment is accuracy.   When performing a Taguchi test, the output parameter must be quantifiable, and this experiment provides this by measuring group size.   The input factors under evaluation include propellant type, propellant charge, primer type, bullet weight, brass type, bullet seating depth, and bullet crimp.  These factors were arranged in a standard Taguchi L8 orthogonal array as shown below (along with the results):


As the above table shows, three sets of data were collected.  We tested each load configuration three times (Groups A, B, and C) and we measured the group size for each 3-shot group.

After accomplishing the above, we prepared the standard Taguchi ANOVA evaluation to assess which of the above input factors most influenced accuracy:


The above results suggest that crimp (or lack thereof) has the greatest effect on accuracy.   The results indicate that rounds with no crimp are more accurate than rounds with the bullet crimped.

We can’t simply stop here, though.  We have to assess if the results are statistically significant.   Doing so requires performing an ANOVA on the crimp versus no crimp results.  Using Excel’s data analysis feature (the f-test for two samples) on the crimp-vs-no-crimp results shows the following:


Since the calculated f-ratio (3.817) does not exceed the critical f-ratio (5.391), we cannot conclude that the findings are statistically significant at the 90% confidence level.  If we allow a lower confidence level (80%), the results are statistically significant, but we usually would like at least a 90% confidence level for such conclusions.

So what does all the above mean?   Here are our conclusions from this experiment:

  • This particular revolver shoots any of the loads tested extremely well.  Many of the groups (all fired at a range of 50 feet) were well under an inch.
  • Shooter error (i.e., inaccuracies resulting from the shooter’s unsteadiness) overpowers any of the factors evaluated in this experiment.

Although the test shows that the results are not statistically significant, this is good information to know.  What it means is that any of the test loads can be used with good accuracy (as stated above, this revolver is accurate with any of the loads tested).  It suggests (but does not confirm to a 90% confidence level) that absence of a bullet crimp will result in greater accuracy.

QMCoverThe parallels to design and process challenges are obvious.   We can use the Taguchi technique to identify which factors are critical so that we can control them to achieve desired product or process performance requirements.   As significantly, Taguchi testing also shows which factors are not critical.  Knowing this offers cost reduction opportunities because we can relax tolerances, controls, and other considerations in these areas without influencing product or process performance.

If you’d like to learn more about Taguchi testing and how it can be applied to your products or processes, please consider purchasing Quality Management for the Technology Sector, a book that includes a detailed discussion of this fascinating technology.

And if you’d like a more in depth exposure to these technologies, please contact us for a workshop tailored to your needs.

A Couple of Great Books

Posted in Manufacturing Improvement with tags , , , on August 25, 2012 by manufacturingtraining

I’ve recently read a couple of great books that I think should be required reading for anyone working in the manufacturing or engineering world.   One of these is Car Guys versus Bean Counters by Bob Lutz, a book I featured a few weeks ago in the California Scooter blog (it’s a blog I write for CSC Motorcycles, one of my clients).  With your permission, I’ll repeat part of that blog entry here.  The other book is The Gun, by C.J. Chivers.   I’ll get to that one a few paragraphs down.

I bought the Lutz book a few months ago when I saw it in an airport while I was on my way to Thailand to present a Manufacturing Leadership course.  Bob Lutz is a certifiable gearhead with the credentials and experience to back it up…he’s held very senior positions with Ford, BMW, Chrysler, and General Motors.  The book is mostly about GM, a company that rehired Lutz to help the company find its way again…which is another way of saying that Lutz’s new job was to conceive, develop, and make GM cars people would want.

A bit of history on this first…in the 1950s and 1960s, GM was ahead of the world in producing exciting cars.  Think 1955 Chevys, the Pontiac GTO, the Corvette, the Olds Toronado and Cadillac El Dorado, the 1959 Coupe de Ville, the SS 396 Chevelle, the El Camino, the Camaro, the Buick Riviera, and, well, you get the idea. It was the golden age for American automobiles and GM was at the top of the heap. Then the company lost its way, and the cars GM cranked out in the mid-70s and beyond were just awful.

Lutz explains that the reason GM fell from glory was not just the financial folks (the “bean counters” of the book’s title), but its pre-occupation with committee-based design efforts that bred a culture of mediocrity.  He makes a strong case for strong-willed leaders who design cars based on their instincts and a connection with the product, not what cost reduction, producibility, and all of the other “ilities” committees will approve.

The good news is that GM is on the way back, and I think you can see that in their new cars. I especially like what’s being offered by Cadillac and Chevy.   I drive a Z-06 and in my opinion there’s nothing more exciting.  It’s American made and it has the right style and sound.

The next book that I have even stronger feelings about is The Gun, by C.J. Chivers.  I was surprised that I hadn’t heard of this book before when I read a review in the New York Times.   The New York Times is about as left-leaning a rag as ever existed and it had high praise for The Gun.   I reasoned that if the leftist Bloomberg lackies liked it, there had to be something there, so I went to Amazon and bought a copy.

The Times was right, but for the wrong reasons.

My impression is that the Times guys did little more than read the press release for The Gun, as all they really mentioned in their review was that the book told the story of the AK-47’s proliferation after the Soviet empire disintegrated.  The AK-47, of course, is the Kalashnikov-designed assault rifle that has become an iconic communist/terrorist/insurgency weapon.  The Gun makes the point that after the Soviet empire fell all eyes were on securing the Soviet nuclear arsenal, yet no Soviet nuclear weapon had ever killed anyone.  AK-47 rifles, however, were all over the world, and they had killed many people.   The production quantities were such that the Soviets could have issued 700 AK-47s to each of their soldiers.  They didn’t do that for obvious reasons…instead, the rifles proliferated and wound up in the hands of terrorists and other low-lifes all over the world.

While the above is interesting, it’s not what The Gun is all about.   The book should perhaps have been titled The Guns, because what it focuses on are the differences between the AK-47 and the US weapon designed in response to it…the M-16.   That, folks, is a fascinating story, and Chivers’ telling of it is masterful.   The producibility, reliability, and engineering tradeoffs made by Colt and Kalashnikov for each of these weapons are fascinating.  Colt focused on accuracy and precision, which made the early M-16s unreliable and less battle-worthy.   The AK-47 focused on reliability, low cost, easy producibility, and just enough accuracy to make the weapon deadly.   In the early Vietnam War days, there’s no question that the AK-47 was a superior rifle.   Chivers’ explanations and comparisons of these two rifles make for great reading, and we use The Gun in our failure analysis, cost reductionmanufacturing leadership, and engineering creativity courses for just that reason.

State of the Art?

Posted in Manufacturing Improvement with tags , , , , on March 14, 2012 by manufacturingtraining

Back to the photo I showed a week or so ago…

An Apache Rotor Blade Bond Joint

The photo above shows a bonded section of an AH-64A Apache helicopter main rotor blade in the area where you see the blue Dykem. It’s where the blade manufacturer and the Army experienced numerous disbonds, and it’s the problem the blade manufacturer had to solve.

An AH-64A Apache at Fort Knox, Kentucky

Before delving into the failure analysis, let’s consider the Apache rotor blade’s design and its history. The Apache helicopter has what are arguably the most advanced rotor blades in the world. They can take a direct hit from a 23mm ZSU-23/4 high explosive warhead and remain intact. During the Vietnam war, a single rifle bullet striking a Huey blade would take out the helicopter and everyone on board. When the Army wrote the specifications for the Apache, they wanted a much more survivable and much less vulnerable blade.

Vietnam-Era Huey Helicopters 

The Apache helicopter prime contractor designed a composite blade with four redundant load paths running the entire rotor blade length. The blade’s advanced design uses titanium, special stainless steels, and honeycomb, but those four redundant load paths were the key to its survivability. If one section of the blade took a hit with a 23mm warhead detonation, the three remaining load paths held the blade together. That actually happened once during the first Persian Gulf war, and the Apache helicopter made it back to its base. It’s an awesome design, but it had a production weakness.

Apache Rotor Blade Sectional View Showing Four Spars 

Let’s also consider the nature of the Apache production approach. Three entities are important here: The US Army (the Apache customer), the prime contractor (who designed the helicopter and its blade), and the blade manufacturer. The blade manufacturer was a built-to-print manufacturing organization. They built the blade in accordance with the helicopter prime contractor’s technical data package.

The manufacturing process consisted of laying up the blade in a cleanroom environment using special fixturing, bagging the blade components in a sealed environment, pulling a vacuum on the bag, transporting the blade to an autoclave, and then autoclave curing.  The autoclave cure was rigidly controlled in accordance with the prime contractor’s specification.

During production startup, many of the blades had a high rejection rate after the autoclave cure. The bond joint (where the stainless steel longitudinal spars overlapped, as shown in our photo above) frequently disbonded.  Eager to get the blade into production, the blade manufacturer, the prime contractor, and the Army pushed ahead.  They believed that due to the “state of the art” nature of the Apache blade’s design, a less-than-100% yield was inherent to the process.  The disbond failures continued into production.  To cut to the chase, the blade manufacturer continued producing the blade for the next decade with an approximate 50% rejection rate.  To make matters worse, blades in service on Apache helicopters only had about an 800-hour service life (the specification called for a 2,000-hour service life).

By any measure, this was not a good situation.  The blade manufacturer had attempted to find the disbond root cause off and on for about 10 years, with essentially no success. While not happy, the Army continued to buy replacement blades, and they continued to send blades back to the prime contractor from the field for depot repairs.  The prime contractor sent the blades back to the blade manufacturer.  In retrospect, neither the prime contractor nor the blade manufacturer were financially motivated to fix the disbond problem.

After a change in ownership, the blade manufacturer realized the in-house blade disbond rework costs were significant. The new management was serious about finding and correcting the blade disbonds. Using fault-tree-analysis-based root cause analysis techniques, the company identified literally hundreds of potential failure causes. The failure analysis team found and corrected many problems in the production process, but none had induced the blade disbonds.  The failures continued. Surprisingly (or perhaps not surprisingly, considering the lively spares and repair business), the helicopter prime contractor did not seem particularly interested in correcting the problem.

After ruling out hundreds of hypothesized failure causes, one of the remaining suspect causes was the bondline width where the longitudinal spars were bonded together. That’s the distance marked on the macro photo with scribe marks on the blue Dykem (the photo I showed you earlier, and the one at the top of this blog entry).  During a meeting with the helicopter prime contractor, the blade manufacturer asked if the bondline width was critical. The prime contractor, evasive at first, finally admitted that this distance was indeed critical. The prime contractor further admitted that if the distance was allowed to go below 0.440 inch, a disbond was likely.

Armed with this information, the blade manufacturer immediately analyzed the prime contractor’s build-to-print rotor blade drawings.  To their surprise, tolerance analysis showed the blade’s design allowed the bondline width to go as low as 0.330 inch. The blade manufacturer inspected all failed blades in house, and found that every one of the failed blades was, in fact, below 0.330 inch.  It was an amazing discovery.

The blade manufacturer immediately asked the prime contractor to change the drawings such that the bondline width would never go below 0.440 inch. The prime contractor refused, most likely fearing a massive claim from the blade manufacturer for a technical data package deficiency spanning several years.  The prime contractor instead accused the blade manufacturer of a quality lapse, stating that this was what allowed the bondline width to go below the 0.440 inch dimension.

The blade manufacturer explained the results of their tolerance analysis again, and once again pointed out that the blade design permitted the disbond-inducing condition. When the prime contractor refused to concede the point (and again accused the blade manufacturer of a quality lapse), the blade manufacturer took a different tack.  As repair facility, the blade manufacturer had blades in house for depot repairs from various points during the Apache program’s life (including the 12th ever blade built, which went back to the first year of production). All of these earlier failed blades had the same problem: They conformed to the technical data package, but their bondline width was below 0.440 inch.

The blade manufacturer, faced with an ongoing 50% rejection rate, decided to hold the blade’s components to much tighter tolerances than required by the prime contractor’s technical data package. By doing so, the blade manufacturer produced conforming blades with bondline widths above 0.440 inch. After implementing this change, the blade disbond rejection rate essentially went to zero.

So what’s the message here?  There are several:

  1. Don’t accept that you have to live with yields less than 100%. You can focus on finding and fixing a failure’s root cause if you are armed with the right tools. Don’t accept the “state of the art” argument as a reason for living with ongoing yield issues.
  2. Don’t think that simply because the product meets the design (i.e., there are no nonconformances) that everything is good. In many cases, the cause of a recurring failure is design related. Finding and addressing these deficiencies is often a key systems failure analysis outcome.
  3. If you are a build-to-print contractor, be wary.  The design agency may not always be completely open to revealing design deficiencies.
  4. It’s easy to become complacent and accept a less-than-100% yield as a necessary fact of life. In some cases, the yield is not just a little below 100%; it’s dramatically less than 100% (as occurred on the Apache rotor blade production program for many years).
  5. There are significant savings associated with finding and fixing recurring nonconformances. You can do it if you want to, and if you have the right tools.

You know, the wild thing about this failure and the Mast Mounted Sight failure mentioned a week or so ago is that the two companies making these different products were literally across the street from each other.  The Mast Mounted Sight was a true show stopper…it stopped production and it probably delayed the start of Operation Desert Storm.  The Apache blade didn’t stop production…it was just a nagging, long-term, expensive rework driver for the Army and the blade manufacturer.  Which one was more expensive?  Beats me, but if I had to guess, I’d guess that the ongoing (but non-show-stopping) nature of the Apache rotor blade failures carried a heftier price tag.

Do you have recurring inprocess failures that you’d like to kill?  Give us a call at 909 204 9984…we can help you equip your people with the tools you need to address these cost and quality drivers!

A Four-Step Problem Solving Approach

Posted in Manufacturing Improvement with tags , , , , on March 8, 2012 by manufacturingtraining

In August 1990 the United States starting sending military forces to the Persian Gulf with the intent of expelling Saddam Hussein’s forces from Kuwait.  We called the buildup Desert Shield, and when we actually went to war on 16 January 1991, the name transitioned to Desert Storm.  When Desert Storm finally started, the engagement was decisive.  In short order, Kuwait was free of Iraqi forces.  It was the beginning of the end for Saddam Hussein.

Desert Shield (the buildup) lasted a good 6 months. The question in those days was:  Why the delay?  We had our forces and those of allied nations in place relatively quickly. Why did 6 months elapse before we crossed the border into Kuwait to expel Saddam?

The true reasons for the lengthy delay may never be known, but I can tell you that a key component of our smart munitions delivery capability was not ready in August 1990. You all remember the dramatic videos…munitions being dropped directly down chimneys, one-drop hits, etc.  All that was made possible through laser-guided munitions (along with the bravery and skill of our fighting forces).

One of the key laser targeting devices was the Mast Mounted Sight, shown in the photo above.  It’s the thing that looks like a big basketball on top of the helicopter.

The Mast Mounted Sight contained a laser target designator, an infrared sensor, and a television sensor.  All were slaved to the pilot’s helmet.  Wherever the pilot looked, that’s where all three beams were supposed to point.  The Mast Mounted Sight had been in production and deployed on helicopters for years.  Everyone thought everything was fine.

But it wasn’t.

When the Desert Shield buildup started, the Army tested its Mast Mounted Sight systems a bit more rigorously, and it discovered what it and the manufacturer thought was an alignment error in the laser, IR, and television lines of sight.  This could have been disastrous.  It meant that the pilot might launch a missile based on the television or the IR sensor being on target, but the laser beam would guide the munition to the wrong spot.  If a miss occurred, it would alert the bad guys, and they could return fire against the helicopter.  Mind you, this system had been in production and deployed in the field for years.

The manufacturer went into high gear to find and fix the failure cause. The Mast Mounted Sight contains an internal alignment mechanism, which is supposed to align all three instruments (the laser, IR sensor, and the TV sensor).  The manufacturer spent the next 6 months looking for a problem in the MMS alignment subassembly.  They didn’t find anything.

Hold that thought.

Ever hear the joke about the drunk looking for his car keys at night under a street light?

It goes like this: I offered to help the drunk find his keys, and after we both searched for an hour, we came up empty-handed.

“Gee,” I said, “are you sure you dropped them here?”

“Oh, no,” responded the drunk. “I lost them over there, by those bushes in the dark…”

“Then why are you looking here under the street light?” I asked incredulously.

“Because I can see here,” he answered.

Many times when we have a production shutdown, or even a low-level recurring failure, finding the root cause is elusive. Production shutdowns get a lot of attention.  Recurring nonconformances frequently do not, but they can just as expensive (sometimes more so) than a line-stopping failure.

So how do we go about finding the root cause of a failure?

Many years ago, the smartest man I ever knew once shared a simple four-step problem solving process with me.  It goes like this:

  • Define the problem
  • Define the causes
  • Define the solutions
  • Select the best solution

Where we usually go south when analyzing failures is with that first step: Defining the problem. Frequently, we start jumping to conclusions about potential causes without taking the time to fully understand the problem. The results are predictable: We spend lots of time chasing our tails, and the problem continues.

Need proof?  Try this exercise:  Tell your staff that you walked into a room, flipped the light switch, and the light did not illuminate.   Then ask them what the problem is.  In most cases, folks will immediately start listing potential failure causes: A broken filament, breaks in the wiring, a defective switch, failure to flip the switch properly, etc.  But those are all incorrect answers.

The question should be:  What is the problem?  That should be our first step.  In this case, the problem is that the light bulb does not illuminate.  All of the other suggestions listed above involved jumping to conclusions about potential causes.

Let’s turn back to the Mast Mounted Sight.  After several months of trying to find a failure cause in the MMS alignment mechanism, the failure analysis team finally decided to take a step back. They reviewed the test data again, and to their amazement, they found that the TV and the laser were aligned.  Only the IR sensor was out of alignment.  The failure analysis team had been solving the wrong problem.  Once the problem came into focus, the team looked outside the alignment mechanism, and they found an IR window heater anomaly.  The fix was a simple software patch.  It was implemented on 15 January 1991, and US troops rolled across the Kuwait border on 16 January 1991.

Would you like to know more about our fault-tree-based Root Cause Failure Analysis training program, or perhaps our book on Systems Failure Analysis?   Check out our Root Cause Failure Analysis page, and give us a call at 909 204 9984 if you would like to know more!

Ppass versus Reliability

Posted in Manufacturing Improvement with tags , , , , , , on March 3, 2012 by manufacturingtraining

Many times companies use sampling techniques to assess a production lot’s acceptability.  You know the drill…you pull a specified sample size, and if all of the samples are acceptable you buy the lot.  If any are unacceptable, you reject the lot.  This approach often works for components, assuming the sample represents the rest of the lot.   But what about larger subassemblies or complete systems?  Does it work for them, too?

Here’s the basic question:  Is your acceptance testing approach consistent with your product’s required reliability?

This is an area where a lot of companies (and buying organizations) put themselves in a serious bind without realizing what they are doing.  In the munitions game, for example, it’s pretty common to pull a specified sample and buy the lot if all of the samples go bang.  The problem is that we think if a product’s reliability is high (say, 95%), we ought to be able to pull a sample and have them all work.  That’s not the way it works in the real world, though.   We can’t go with our intuition here; we have to evaluate the probability of passing the acceptance test more rigorously to assure that it is consistent with the required reliability of whatever it is we are testing.

I first ran into this at Aerojet when we were building munition fuzes.   We were failing most of our lot acceptance tests, and we thought we had a pretty good product.   The submunition had a 95% reliability requirement, and in live tests we showed we met that requirement.  We routinely dropped bombs and had more than 95% of the submunitions detonate.

We had a lot acceptance requirement on the fuzes, however, that required firing a sample of 32 with no failures.   We were only passing about one lot out of every five.  What was going on?

What we didn’t realize (at least initially) is that there’s a fundamental difference between demonstrating a product’s reliability and passing a specified-sample-size test with zero failures.   There’s a relationship between a product’s reliability and the probability of passing its acceptance test that can be shown with something called an operating characteristic curve.   For that test I just described (n = 32, acc/rej =0/1), the x-y plot below shows it clearly:

Check the above plot, and you’ll see that with a product reliability of 95%, you’ll only pass the acceptance test about 20% of the time (and that was exactly what we were experiencing).

When we explained this to our Air Force customer, they didn’t like what they were hearing, but they recognized and agreed with the mathematics.  Ultimately, they modified the fuze acceptance test requirement so that it was consistent with the product’s required reliability.  Don’t think that this allowed lower quality munitions to get into the inventory, either.  That particular munition system routinely delivered reliability well in excess of its requirements, and during the 1991 Persian Gulf War, it was the munition that took out the bulk of Saddam Hussein’s Republican Guard tanks.

Another manufacturer was not so lucky.  They manufactured flares for the US Navy, and they encountered precisely the same problem with precisely the same numbers.  The Navy’s reliability requirement was 95% (which the flare met), but they imposed that same lot acceptance requirement (a sample size of 32 flares, accept on 0 failures, reject on 1 or more failures).  Predictably, the company failed 80% of their lot acceptance tests.   Unfortunately, in this case, neither the Navy nor the manufacturer realized what was happening.

I know about that second situation because I was an expert witness when the manufacturer sued the Navy.   When I testified at the Armed Services Board of Contract Appeals, my task was to explain all of the above in a manner that lawyers and the trial judge could understand.  In my experience, lawyers and judges don’t grasp probability and statistics concepts easily, so just stating that the situation was governed by the binomial distribution wasn’t going to cut it.

I went shopping the night before I testified and bought two bags of coffee beans (one with white beans, and one with brown beans).   I put 5 white beans in a bag (representing unreliable product), and 95 beans in the same bag (representing product that would work).  I stuck the bag in my pocket the next morning and went to court.

After explaining the binomial distribution, the nature of the relationship between a product’s reliability and the probability of passing a test, and the x-y plot you see above, I could see that the judge (who was a good guy) had glazed over.  When I finished, I told the judge I could demonstrate the concept for him.  I pulled the bag of coffee beans out of my pocket and explained the contents, and I offered to pull out 32 beans.  The lights came on.  The judge smiled.  He told the Navy’s attorney to pull the beans out of the bag.  The 17th bean was a white one, representing a flare that wouldn’t work (and a failed lot acceptance test).

It was a cool display, it was a deciding factor in the manufacturer winning its $25.4 million claim against the Navy, and that little demonstration was cited as one of the best Armed Services Board of Contract Appeals explanations that year.

So, think about this…when you specify (or agree to) a sample-based test, what’s the reliability of the thing you’re testing, and is it consistent with your test?  If you are failing sample-based acceptance tests, you may simply have an overly-stringent acceptance test.   These kinds of evaluations sound complicated, but Excel makes it a lot easier than it used to be.  The operating characteristic curve is one of the key concepts we should always consider in such situations, and it’s a key part of the root cause failure analysis training we offer.