Maximize the Menu
 
Measuring Product Reliability

About the RIAC Blueprints
The RIAC "Blueprints for Product Reliability" are a series of documents published by the Reliability Information Analysis Center (RIAC) to provide insight into, and guidance in applying, sound reliability practices. The RIAC is the Information Analysis Center chartered to be a centralized source of data, information and expertise in the subjects of reliability, maintainability and quality. While sponsored by the US Department of Defense (DoD), RIAC's charter addresses both military and commercial communities with the requirement to disseminate guidance information in these subjects. The Blueprints serve to provide information on those approaches to planning and implementing effective reliability programs based on experience, lessons learned, and state-of-the-art techniques. To make the Blueprints as useful as possible, the approaches and procedures are based on the best practices used by commercial industry and on the concepts documented in many of the now-rescinded military standards. The tree shown in Figure 1 depicts the Blueprints that make up the series.

In the government sector, and in particular the DoD, significant changes have been made regarding the acquisition of new products. Previously, by imposing standards and specifications, a DoD customer would require contractors to use certain analytical tools and methods, perform specific tests in a prescribed manner, use components from an approved list, and so forth. Current policy emphasizes the use of commercial technology as well as specifying "performance-based" requirements only, with suppliers left to determine how to best achieve them.

Figure 1. RIAC Blueprints for Product Reliability
Figure 1. RIAC Blueprints for Product Reliability (Click to Zoom)


Users of the RIAC Blueprints
The Blueprints are designed for use in both the government and private sectors. They address products ranging from completely new commercial consumer products to highly specialized military systems. The documents are written in a style that is easy to understand and implement whether the reader is a manager, design engineer or reliability specialist. In keeping with the new philosophy of the DoD, which is now similar to that of the private sector, the Blueprints do not provide a cookbook of reliability tasks that should be applied in every situation. Instead, some general principles are cited as the underpinnings of a sound reliability program. Then, many of the tasks and activities that support each principle are highlighted in detail sufficient for the user to determine if a task or activity is appropriate to his or her situation.



SECTION ONE - INTRODUCTION

The purpose of this Blueprint, Measuring Product Reliability, is to provide guidance in applying various measurement techniques in a tailored reliability program.

Measurements are an integral part of any program. Reliability, like all other product characteristics, needs to be measured at various times and in various ways, depending on the program. Section Two discusses measurement philosophy. Section Three presents an assortment of measurement techniques which can be applied where they add value.

The discussion of each measurement technique will consider:
  • Purpose (what)
  • Benefit (why)
  • Timing (when)
  • Application guidelines (how)




SECTION TWO - Measurement Philosophy

Reliability is traditionally considered to be a performance attribute that is concerned with the probability of success and frequency of failures, and is defined as:

       The probability that an item will perform its intended function understated conditions, for either a specified interval or over its useful life.

Reliability is a measure of a product's performance that affects both product function and operating and repair costs. Too often performance is thought of only in terms of speed, capacity, range, and other "normal" measures. If, however, a product has such poor reliability that it is seldom available for use, these other performance measures become meaningless. Reliability is also critical to safety and liability issues.

Superficially, measuring reliability is a simple matter. One merely counts failures and divides by operating time to come up with a mean time between failure (MTBF), the most common reliability measure. However, one would be remiss in accepting this number without considering many factors, such as:
  • Is MTBF the appropriate measure? If not, what should it be?
  • Is the assumption of a constant failure rate reasonable? If not, how should the data be analyzed?
  • Is the definition of failure appropriate?
  • Did reliability change during the test period? If so, what is the current MTBF?
  • Did the test environment simulate the use environment? If not, can a correlation be made?
  • Is the test cost-effective and of sufficient length to meet the risk constraints of both the supplier and the customer?
  • How much confidence does the data provide in the measured results?
  • Does it answer the supplier's needs? If not, what else is needed?
  • What can be done to measure reliability before prototypes of the product, and the ability to test them, are available?
This discussion merely shows that measuring reliability starts with a philosophy. The methods used should be those that meet the customer's needs in accordance with the strategy of the organization making the measurements. For this reason, a variety of measurement methods, including both test methods and specialized analytical techniques, have been developed.

2.1 Calculation of Confidence Limits

As an example, consider the simple case of measuring product MTBF from field data for the purpose of establishing a warranty. Suppose the results indicate an MTBF of 1000 hours. Other factors aside, this figure could represent 1000 hours of operation with one failure or 10,000 hours of operation with 10 failures. The latter case is based on more data than the one-failure case, and therefore should be closer to the "truth". Assume that the manufacturer accepts the test data as representing the MTBF expected in the field (i.e., the assumption of a constant failure rate is valid, the test conditions are reasonable, and the products tested are representative of those to be manufactured), but that he wants a 90% confidence in the MTBF on which to base a warranty (i.e., the manufacturer wants to be 90% sure that the warranty MTBF is not higher than the true MTBF of the product on warranty). The measured value of MTBF would not be used to establish the warranty, but, rather, a lower value of MTBF obtained from the formula:
  MTBF = ( 2T / χ2c, d )
where,  
  T = Total Test Time
χ2c, d = Value at the "c" percentage point of the Chi-squared Distribution with "d" degrees of freedom
d = 2(N + 1)
N = Number of failures

The value (χ2c, d) is read from a standard Chi-Squared distribution table. For the case of one failure in 1000 hours, the MTBF with 90% confidence is:
  MTBF = [ 2 (1000) / X290, 4 ] = 2000 / 7.78 = 257

For 10 failures in 10,000 hours:
  MTBF = [ 2 (10000) / X290, 22 ] = 20000 / 32 = 625

The difference between a warranty based on 257 hours and one based on 625 hours is a competitive advantage to be weighed against the cost of the additional test hours.

2.2 Constant Failure Rates

Many compendiums of failure rate information for electronic parts have been prepared under the assumption that the failure rates will be reasonably constant over time. This assumes that "burn-in" before delivery or stringent process control has eliminated the period of declining failure rates when "infant mortality" failures due to defects occur, and that the parts will experience no wearout mechanisms within the useful life of the product. Failure rates which are not constant can often be treated as constant, under conditions such as those described below:
  • "Endless burn-in" refers to a theory that the failure rate of a population of parts will improve throughout its useful life as defective parts fail and are replaced. Even if this theory holds true, there usually is a point at which the change is so slight as to permit the use of an assumed constant failure rate.
  • A population of parts with mixed ages will appear to have a constant failure rate. For example, light bulbs display a normal distribution of failures, but if bulbs are replaced as they fail, the population will ultimately show a constant failure rate (see Figure 2)
Figure 2. Light Bulb Failures vs. Time
Figure 2. Light Bulb Failures vs. Time (Click to Zoom)
  • A product made up of many different parts with many different failure distributions will exhibit a random distribution of failures as if it had a constant failure rate.
  • Components subject to wearout mechanisms can be assumed to have constant failure rates for small intervals of their life. For example, automobile transmissions can be assumed to have a constant failure rate between zero and 10,000 miles of use, a higher constant failure rate between 10,000 and 20,000 miles, etc. The results of this assumption will be conservative (i.e. pessimistic), which may lead to unnecessary expense in some cases, but may be preferable to optimistic results.
Failure rates in many compendiums are obtained from field or test data fitted to models which account for part quality, operational stresses and other factors. The models will be as good as the data that represents the parts of interest. For example, microcircuit models have become obsolete quite rapidly, due to the explosive changes in the technology. Table 1 lists some of the numerous compendiums of part failure rate models, some of which were developed for specific applications.

Table 1. Failure Rate Compendiums Assuming Constant Failure Rates
MIL-HDBK-217 Reliability Prediction of Electronic Equipment (Parts Count Method)
MIL-HDBK-217 Reliability Prediction of Electronic Equipment (Stress Analysis Method)
Bellcore Reliability Prediction Procedure
British Telecom Handbook of Reliability Data (HRD4)
Nippon T&T Standard Table for Semiconductor Devices
French National Center for Telecommunications (CNET) Stress Model
French National Center for Telecommunications (CNET) Simplified Model
Siemans Procedure
SAE Automotive Electronic Reliability Prediction

A great advantage of models assuming constant failure rates is their ease of use. For example, if a system is composed of 5 components with failure rates of .0010, .0020, .0025, .0030 and .0020 failures per hour, respectively, the system reliability is calculated as:

  R= e-(.0010 + .0020 + .0025 + .0030 + .0020)t = e-(.0105)t
For t = 100 hours
  R = e-(.0105)100 = e-1.05 = 0.35

Hence, when failure rates are constant, the system reliability can be calculated directly from the sum of the part failure rates. In contrast to other methods, such as handling a combination of parts with Weibull failure rates, this provides an attractive simplicity.

2.3 Reliability Physics

The reliability physics, sometimes called the "physics-of-failure", approach to reliability assumes that with all manufacturing processes under tight control, manufacturing and material defects and process variabilities can be totally eliminated (zero failure rate during the traditional infant mortality and useful life segments of the bathtub curve) and only wearout type failures will occur in fielded equipment. This leads to the concept of a product achieving a failure-free operating period during its useful life.

A reliability physics analysis looks at individual failure mechanisms such as electromigration, solder joint cracking, die bond adhesion, etc., to estimate the probability of device wearout during the product's useful life. The methodology requires detailed knowledge of all material characteristics, geometries and environmental conditions. Specific models for each relevant failure mechanism are available from a variety of reference books. As an example, a typical model for microcircuit bond pad/die shear fatigue is shown below, where the dependent coefficients are determined through the use of published manuals on material characteristics:

  t50 = A2(K2ΔT)n2 (1.2)
where,  
  t50 = Mean-time-to-failure (hours)
A2 = Bond pad material dependent coefficient
K2 = Die material dependent coefficient
n2 = Bond wire material dependent coefficient
ΔT = Temperature change of bond pad and die (°C)

2.4 Product Program Phases

Each product, from the simplest to the most complex, passes through a sequence of phases during its life cycle. The definitions of the phases vary among commercial companies, and within the military. Table 2 describes the sequence of general phases that will be used in this document to describe a product's life, and the appropriate timing of when measurements can be made.

Table 2. Product Life Cycle Phases
Concept/ Planning Design/ Development Production/ Manufacturing Operation/ Repair Wearout/ Disposal
  • Formulate ideas, estimate resources and financial needs
  • Identify risks & requirements
  • Program objective
  • Identify and allocate needs and requirements
  • Propose alternate approaches
  • Design and test the product
  • Develop manufacturing, operating, and repair/ maintenance tasks
  • Refine and implement manufacturing procedures
  • Finalize production equipment
  • Establish quality processes
  • Build & distribute the product
  • Implement operating, installation and training procedures
  • Provide repair and maintenance service
  • Repair warranty items
  • Provide for performance feedback
  • Implement refurbish- ment and disposal tasks
  • Resolve potential wearout issues

What distinguishes one phase from the next is generally a decision milestone, sometimes referred to as a "gate." It represents a point in time where the program can go forward or stop. For many products, the phases may be abbreviated or combined. For example, the Concept/Planning and Design/Development phases may be combined under a compressed schedule for a new product that is simply an update or slightly modified version of an older, proven product. Reliability measurement tasks for this type of program would concentrate only on the differences between the old and the modified product. As a result, the number of engineering tasks would be reduced. It is important to understand that tasks performed in one phase are often the result of the analysis, trade-offs and planning performed in an earlier phase. For example, analytical measurement techniques addressing alternative approaches to the reliability of printed circuit boards would be performed during Design/Development, with reliability testing to measure the results of the process decision following during the Production/Manufacturing phase.

Table 3 provides an overview of those tasks that have been commonly used to measure quantitative and qualitative product reliability, through either analysis or test. In later sections that address the relevance of each task in measuring reliability, the reader should place the emphasis on the value added for the customer, not the nature of the task itself, in tailoring its effectiveness.

Table 3. Reliability Tasks Relevant to Measuring Reliability
Type of Activity Tasks and Description Section
 
 
 
A
N
A
L
Y
S
I
S
Durability Analysis. Determination of whether or not the mechanical strength of a product will remain adequate for its expected life. 3.2
Failure Modes, Effects&Criticality Analysis (FMECA). Systematically determining the effects of part or software failures on the product's ability to perform its function. This task includes FMEA. 3.3
Failure Reporting Analysis&Corrective Action System (FRACAS). A closed- loop system of data collection, analysis and dissemination to identify and correct failures of a product or process. 3.4
Fault Tree Analysis (FTA). Using inductive logic to determine the possible causes of a defined undesired operational result. 3.5
Life Cycle Planning. Determining reliability (and other) requirements by considering the impact over the expected useful life of the product. 3.1
Predictions. Estimation of reliability from available design, analysis or test data, or data from similar products. 3.6
Sneak Circuit Analysis (SCA). Investigation to discover the existence of unintended signal paths in a product. 3.7
Worst Case Circuit Analysis (WCCA). Analysis of the effects of variability in the components of a product on the product's performance. 3.8
 
 
T
E
S
T
Accelerated Life Testing. Testing at high stress levels over compressed time periods to draw conclusions about the reliability of a product under expected operating conditions, based on formulated correlation factors. 3.10
Production Reliability Acceptance Test (PRAT). Testing a product during production to assure that its reliability has not degraded. (Quantitative measurement). 3.13
Reliability Demonstration Test (RDT)/Reliability Qualification Test (RQT). Testing a product to measure, thorough demonstration, whether its reliability requirement has been achieved. 3.11
Reliability Growth Test (RGT)/Test Analyze and Fix (TAAF). Testing a product to identify reliability deficiencies in order to eliminate their causes. 3.12
Test Strategy. Determination of the most cost effective mix of tests for a product. 3.9




SECTION THREE - Measurement Techniques

3.1 Life Cycle Planning

3.1.1 Purpose. Life cycle planning is the development of design guidance and analysis and test strategy through consideration of the expected conditions impacting the product from its introduction into the marketplace to its eventual obsolescence. It is not a measurement per se, but a systematic and strategic application of a series of measurements to determine the expected stresses on the product and its ability to operate reliably under those stresses. It may call for analysis in lieu of some testing, the performance of specific tests, and combinations of analysis and test measurements.

3.1.2 Benefits. Life cycle planning represents the only possible way to assure that the product will operate reliably and economically for its entire product life cycle.

3.1.3 Timing. Life cycle planning must be done in the early stages of the Concept/Planning phase of the product life cycle, before it is designed, as the inherent design will be the largest contributor to its reliability and longevity. Life cycle planning includes development of a reliability measurement strategy to be used during the Design/Development and Product/Manufacturing product phases. While the planning is done early, the measurements involved are performed at appropriate stages of the product development over its life.

3.1.4 Application Guidelines. The key to effective life cycle planning is the determination of the environments in which the product will be operated. A personal computer for home use will not experience stresses as severe as the engine controls of an airliner. This difference will translate into different design rules and analysis/test measurement strategies under life cycle planning. The transportation, handling and storage stresses expected must also be determined for realistic planning to preclude reliability problems. Besides the physical environment (e.g., temperature, vibration, etc.), the use environment (e.g., speed, cycle rate, miles, etc.) must also be considered, especially for mechanical parts which usually exhibit wearout modes of failure proportional to the severity of the use environment. Environments can be measured or extrapolated from past experience.

The expected environment is an important factor in defining the strategy considered necessary to adequately measure the reliability of the product. These measurements may include:
  • Durability Analysis. Mechanical systems characteristically fail from wearout mechanisms, while electronic systems more closely exhibit a constant failure rate over their useful life. Durability testing may be necessary to establish reliability for mechanical products over the planned life under the expected environmental conditions.
  • Predictions of Reliability. Detailed predictions are based on failure rate models which require environmental inputs. Predictions based on experience with similar products must consider the degree of similarity of the environments.
  • Dedicated Tests. Growth tests and reliability demonstration tests should generally be performed using the expected environment. Accelerated tests need to be correlated to the expected use environment.
If the product will be subjected to significant periods of dormancy during its lifetime, its design and packaging should attempt to minimize the effects of failure mechanisms expected under dormant conditions (e.g., corrosion, condensation, etc.).

Table 4 provides a matrix of analysis and test tasks that may be considered as part of the product life cycle plan. A "plus" sign (+) indicates that the activity offers value to the program under that circumstance. A "minus" sign (-) means that the activity is probably not cost effective for that circumstance. A "question mark" (?) indicates that the activity may or may not add value for that circumstance, depending on the type of product. The circumstances considered are New Development (i.e., a product to be designed and built for the first time), COTS (an item available as a Commercial Off-the-Shelf product), Safety Critical (e.g., a nuclear plant control system), Dormancy (i.e., an item to be subjected to long periods in storage or otherwise unpowered), Long Life (an item likely to be in service for a relatively long time, such as the B-52 Aircraft), Harsh Environment (high shock, rapid thermal cycling, et. al.), and S/W (Software) Development. The user should not blindly follow this chart. He should decide whether he agrees with the relationships and what weights should be put on them. This requires some familiarity with the methods. Also, other considerations not on the matrix should be identified and considered. These might include suppliers' reputations, the leverage that the manufacturer has with suppliers, the relative importance of reliability and product cost as competitive factors, and the customer's expectations.

Table 4. Life Cycle Planning Matrix for Measuring Reliability
Reliability Test Technique Program/Product Circumstances
New Dev. COTS Safety Critical Dormancy Long Life Harsh Env. S/W Dev.
Prediction + - + + + + +
FMECA + - + + + + +
FTA + - + + + + +
SCA ? - + ? ? ? +
WCCA ? - + + + + -
Durability Analysis ? - + + + + -
FRACAS + + + + + + +
Accelerated Life Test ? - + + +

+

-
RDT/RQT + - + ? ? + +
RGT/TAAF + - ? ? + + +
Life Cycle Planning + - + + + + ?
PRAT ? ? + ? ? ? ?
Test Strategy + - + + + + +

3.2 Durability Analysis

3.2.1 Purpose. Durability analysis is used to estimate, through analytical measurement, the end of life of a product to ascertain whether or not it would be economical to continue operation beyond its originally intended life. Durability analysis measures the effects of time dependent failure mechanisms (i.e., wearout of the product), and is not concerned with technological obsolescence.

3.2.2 Benefit. Durability analysis provides a basis for a decision to extend the operating life of a product, or provides warning that such a decision would incur excessive risk. While there are many factors to consider, one cannot extend the operational life of a product without considering whether or not the product can endure the additional stress.

3.2.3 Timing. Some form of durability analysis is necessary during the Concept/ Planning phase, before the product is built, in order to assure that it will survive its intended life. Further analysis may be performed, as necessary, during product Design/Development or before the product approaches Wearout/Disposal, to support consideration of extending the product useful life. Results should be available before options are discarded (e.g., a decision to replace a product should be made while there is still time to create a replacement for it).

3.2.4 Application Guidelines. Durability analysis is based on the wearout characteristics of the product, which must be determined. The best way, if possible, is to test the product itself. A manufacturer of automatic transmission systems, for example, may test a sample of products to failure to determine their useful life. A trucking company may do the same by recording the mileage and transmission failures of the vehicles in his fleet. Either may use a Weibull analysis.

A Weibull analysis is one graphical method for determining an equation relating the percentage of products failed against operating time, miles driven, number of cycles, etc. On commercially available Weibull analysis graph paper, the time at which each failure occurs is plotted against the percentile of the population on test which have failed at that time. A straight line plot indicates a good fit to the Weibull distribution:
F(t) = 1 - e -(t/θ) β

where F(t) is the percent of the population failed at time (t) and the parameters (θ) and (β) are derived from the graph. ( is the characteristic life, or the point at which 63% of the population have failed, and is determined by the slope of the plotted line on the graph).

Once the equation has been formulated, the user can predict the point at which any percentile of the population will have failed, which may be, as an example, the 50th percentile (to determine the time at which half of the products have worn out) or the tenth percentile (to determine the time when ten percent have failed and 90% remain operational). Whatever the criteria, the means would exist to quantify the probability of successful life extension to any given point in time.

Besides testing, the analyst can also take advantage of published information on parts and materials in common use. For example, suppliers of ball bearings often provide a formula for predicting the tenth percentile of bearing life based on load and number of revolutions:

  L10 = ( C /P )K x106 revolutions
where,  
  C = Basic load rating
P = Equivalent radial load
K = 3 for ball bearings

The durability analysis of structural products can be based on published "S-N Curves," such as shown in Figure 3 for a copper alloy. These curves relate the number of cycles of expected life based on the load applied. Expected life can be defined in many ways, such as the tenth percentile, the 50th percentile, or the characteristic life (63rd percentile). The user must be aware of what value a chart uses. In Figure 3, data is presented as a scatter band showing the spread in a number of measurements.

Figure 3. Copper Alloy S-N Curve (Example)
Figure 3. Copper Alloy S-N Curve (Example) (Click to Zoom)

When the load varies on a material, it is still possible to use an S-N curve by determining an equivalent load. This is a constant load which would result in the same lifetime as the actual variable load. One procedure, based on Miner's rule, is illustrated in Table 5.

Table 5. Example of Equivalent Load Determination
Process Example
1. For each load, determine the expected life time and the number of cycles under that load for a given period. 1. A material is exposed to a load consisting of 100 cycles at 230 load units and 50 cycles at 180 load units. From Figure 3, the expected life at 230 units is approximately 105 cycles, and at 180 units is approximately 106 cycles.
2. Set up the equality:
∑ (ni / Ni) = ∑(ni) / Neq
where,

ni = number of cycles under load
No = expected life under load (i) from S-N curve
2. n1 = 100, n2 = 50, N1 = 105, N2 = 106

100 / 105 + 50 / 106 = (100 + 50) / Neq

1050 / 106 = 150 / Neq
3. Solve the equation for Neq. Also, the equivalent load may be found from the S-N curve by locating the point on the curve representing Neq , and reading its value on the load axis. 3. Neq = 150 x 106 / 1050 = 1.4 x 105 cycles

From Figure 3, the equivalent load would be approximately 200 load units.

Another method for measuring the reliability of materials (i.e., the percentile surviving to a desired life), is the mechanical stress/strength interference method. It is based on the premise that both the stresses encountered and the strengths of materials will vary as a function of some distribution. When the distributions of stress and strength overlap, as illustrated in Figure 4, a failure can occur.

Figure 4. Mechanical Stress-Strength Interference
Figure 4. Mechanical Stress-Strength Interference (Click to Zoom)

3.3 Failure Modes, Effects and Criticality Analysis (FMECA)

3.3.1 Purpose. Failure Modes and Effects Analysis (FMEA) and Failure Modes, Effects and Criticality Analysis (FMECA) are tools to determine the effects of failures of the parts making up a product, based on a bottom-up approach. The measurement provided is not a number, but a list of end effects stemming from the part failures.

3.3.2 Benefit. The systematic nature of FMEA and FMECA assures that every part in the product is considered in determining the effects of part failures on its performance. This comprehensive knowledge is the basis for recommended product reliability, maintainability and safety improvements. Criticality analysis helps determine the relative priorities of proposed changes.

3.3.3 Timing. Because it is comprehensive, an FMEA or FMECA at the part level can only be done when the parts list is complete (Design/Development phase). However, in order to permit corrective action, the analysis should be done before the design is frozen, or else the measurement is taken for no benefit. Similarly, a process FMEA must await process definition during the Concept/Planning and Design/Development phases but should be done before the process is fixed. A software FMEA should await software flow charting to determine the appropriate modules to be analyzed, but should precede coding.

3.3.4 Application Guidelines. An FMEA is a systematic approach to creating a list of recommended corrective actions from a list of parts or functional block outputs, as shown in Figure 5. This is true whether the analysis covers parts/functions (i.e., resistors, microcircuits, computers or printers), process steps (i.e., soldering and cleaning), or software code modules. FMEA/FMECA can be used for measuring reliability improvements, determining safety hazards, and helping to design troubleshooting procedures and built-in-test features. Its measurements are essential to implementing reliability centered maintenance (RCM). Processes and software can be improved with versions of FMEA developed for these items. Though systematic, FMEA does not include the effects of any factor other than parts/functions (or the process or software equivalents). Human errors and external causes of failure are not considered. One approach to an FMEA/FMECA is described in the following list of activities. Each item describes a column of data to be entered on a worksheet. The data needed and the worksheet organizing it can change with the user's needs. For this illustration, the worksheet in Figure 6 would be appropriate.

Figure 5. Generic FMEA Approach
Figure 5. Generic FMEA Approach (Click to Zoom)


Parts
List
Function Failure
Modes
Effect Severity Recommended
Action
Local End
 















































Figure 6. Generic FMEA Worksheet
  • List the Parts/Functional Blocks. The left hand column of the FMEA worksheet is a complete list of the parts or functional blocks comprising the system of interest.
  • List the Function of Each Part/Functional Block. In order to determine the effects of part/functional failures, it is essential to know what the parts/functions are intended to do.
  • List the Failure Modes of Each Part/Functional Block. For example, a resistor failure might be a short circuit, an open circuit or a change in resistance. Each of these may have a different effect on product performance.
  • Determine the Local Effect of the Failure Mode. For example, one "stuck" microprocessor pin may result in wrong data put on line, while another "stuck pin" might lock the address bus.
  • Determine the End Effect. Continuing the above example, wrong data on a signal line may result in incorrect sums on checks printed by a payroll system or garbled messages in a communications system, while a locked address bus might result in a system crash.
  • Determine the Severity. In a payroll system, a system crash may be preferable to the issuance of incorrect checks, while having some garbled messages may be preferable to a crash of a communications system. Hence, some measure of product impact severity is useful in setting priorities for corrective action. This can be represented by a ranking on a scale (e.g., a number between one and ten with one representing a negligible effect and ten representing a catastrophe). One procedure defines four categories of severity as follows:
Category Number Category Description Category Characteristics
1 Catastrophic Causing death or physical damage to product/other equipment
2 Critical Causing severe injury, major property damage, or loss of product performance
3 Marginal Causing minor injury, minor property damage, or degradation of product performance
4 Minor Causing only unscheduled product replacement or repair
  • Recommend Corrective Action. The last column of an FMEA worksheet should list the recommended actions appropriate to correct or minimize the existence or severity of the failure. This can be as simple as a statement that a part should be purchased from a vendor with a good reputation for quality, or as complex as a recommendation that error detection and correction capability be added to a product.
Criticality Analysis. While severity is useful in setting priorities, it considers only the effect of a failure. Criticality analysis adds one or more dimensions to this. For example, one could consider both the severity and the probability of occurrence in calculating the criticality of an effect. The probability can be calculated if the part failure rate, the relative frequency of the different failure modes and the conditional probability that the failure mode will cause a product failure can be reasonably estimated. Otherwise, the failure mode can be assigned to one of the following levels:

Criticality Level Criticality Occurrence Criticality Probability
A Frequent High probability of occurrence (>0.20) during product use
B Reasonably Probable Moderate probability of occurrence (>0.10, <0.20)
C Occasional Occasional probability of occurrence (>0.01, <0.10)
D Remote Unlikely probability of occurrence (>0.001, <0.01)
E Extremely Unlikely Probability of occurrence is essentially zero (>0.001)

The priority of corrective actions can then be set by plotting the severity category against the probability of occurrence on a chart such as the one shown in Figure 7. The left hand scale can be either a calculated probability of occurrence or the level of probability of occurrence defined above. In either case, failure modes closest to the upper right hand corner of the chart are considered the most critical and should receive the highest priority for corrective action.

Figure 7. Determination of Corrective Action Priority
Figure 7. Determination of Corrective Action Priority (Click to Zoom)

Existing commercial FMECA procedures often calculate criticality by computing a risk priority number (RPN). This is the product of three figures called severity, occurrence and detectability. Severity is defined on a scale of one to ten with one representing a minor effect and ten representing a catastrophe. Occurrence is defined on a similar scale with one representing something that almost never happens and ten representing something that happens quite often. Detectability refers to the likelihood of noticing the onset of the effect. For example, a blowout due to tire wear should never happen because the user can see how much tread he has left. Detectability is also rated on a scale with one representing effects that should never come as a surprise, and ten representing effects that are almost always a complete surprise. The three numbers are multiplied together, giving each failure mode an RPN between one and 1000. The higher the number, the greater the risk associated with the failure mode (and the need to identify/implement corrective action). RPN is used in an FMEA industry guideline published jointly by Ford, General Motors and Chrysler. A worksheet from the standard is shown in Figure 8.

Figure 8. Automotive Industry FMEA Worksheet
Excerpt from "Figure 8. Automotive Industry FMEA Worksheet" See Full Version


3.4 Failure Reporting, Analysis and Corrective Action System (FRACAS)

3.4.1 Purpose. A failure reporting, analysis and corrective action system (FRACAS) qualitatively measures the current reliability of the product. Its data can possibly be used to formulate a numerical reliability number, but its purpose is to identify problems in the product. It provides the data needed to identify deficiencies for correction that have been derived directly from product testing or use.

3.4.2 Benefit. FRACAS is one of the most valuable tools of reliability engineering. While it is always better to prevent problems, it is advantageous to identify and correct reliability problems which do occur before the product is released to the customer. FRACAS is the backbone of reliability improvement. It provides information needed for the timely identification and correction of design errors, part problems or workmanship defects. All of these deficiencies preclude the ability to achieve the inherent reliability potential in the design. The losses associated with the inability to achieve inherent product reliability can include significant direct costs in factory rework, scrap, or warrantee service, and even greater indirect costs in lost customers.

3.4.3 Timing. FRACAS requires a source of data before it can be implemented. Once hardware/software begin to become available, and definition and implementation of processes has begun, a working FRACAS should be in place and failure data collected by the manufacturer from any tests and operational usage (Design/Development through Production/Manufacturing). The FRACAS should remain in use as long as the product is being supported by the manufacturer (i.e., through the Operation/Repair phases of the product). Customers may, and should, have their own FRACAS to identify operational reliability problems for correction during their use of the product.

3.4.4 Application Guidelines.An ideal FRACAS is shown in Figure 9.

Figure 9. An Ideal FRACAS
Figure 9. An Ideal FRACAS (Click to Zoom)
  1. Observation of the failure
  2. Complete documentation of the failure, including all significant conditions which existed at the time of the failure
  3. Failure verification, i.e., confirmation of the validity of the initial failure observation
  4. Failure isolation, localization to the lowest replaceable defective item within the product
  5. Replacement of the suspect defective item
  6. Confirmation that the suspect item is defective
  7. Failure analysis of the defective item
  8. Data search to uncover other similar failure occurrences and to determine the previous history of the defective item and similar related items
  9. Establishment of the root cause of the failure
  10. Determination, by an interdisciplinary team, of the necessary corrective action, especially any applicable redesign
  11. Incorporation of the recommended corrective action into development products
  12. Continuation of development tests
  13. Establishment of the effectiveness of the proposed corrective action
  14. Incorporation of effective corrective action into production equipment

The key to a successful FRACAS is its database and the use that's made of the data (closed-loop communication). This is particularly important in establishing the significance of a failure. For example, the failure of a capacitor in a Reliability Growth Test becomes more important if the database shows similar failures in incoming inspection of the part and in any environmental tests performed. For this reason, all available sources of data should feed the FRACAS. Initial failure reports should document, as applicable:
  • Location of failure
  • Test being performed
  • Date and time
  • Part number and serial number
  • Model number
  • Failure symptom
  • Individual who observed failure
  • Circumstances of interest (e.g., occurred immediately after power outage)
The failure documentation should be augmented with the verification of failure (step 3 in Figure 9), and verification that the suspect part did indeed fail (step 6). The number and formats of the failure reporting forms should be determined by the manufacturer to best meet its needs and any requirements of the customer. Figure 10 provides an example of a typical failure analysis report form.

Figure 10. An Example Failure Analysis Report Form
Excerpt from "Figure 10. An Example Failure Analysis Report Form" See Full Version

Once the failure is isolated, the FRACAS database and failure analysis can be used to determine its root cause. Given the root cause, appropriate corrective action can be defined and implemented. Failure analysis can be performed to various degrees, and usually requires some coordination with and cooperation from the part supplier. The most critical failures (i.e., those that occur most often, are most expensive to repair, or threaten the customer's safety) should receive the most comprehensive analysis, perhaps including X-rays, scanning electron beam probing, etc., which require specialized equipment. Where the manufacturer does not have a comprehensive failure analysis laboratory, outside independent laboratories can be hired to perform these functions.

3.5 Fault Tree Analysis (FTA)

3.5.1 Purpose. Fault Tree Analysis (FTA) is a top-down structured approach to identify the possible causes of an undesired product condition. FTA does not measure a numerical parameter, but qualitatively measures areas for improvement.

3.5.2 Benefit. Unlike FMECA, which considers only the effects of part and functional block failures, FTA considers all factors which can lead to a product failure, such as human factors, induced failures, and combinations of product hardware and software failures. Though more comprehensive than FMECA, it is less systematic, and hence subject to errors of omission.

3.5.3 Timing. A Fault Tree Analysis can be performed at any time from the creation of the product during the Concept/Planning phase, to the replacement of the product during the Wearout/Disposal phase, depending on whether it is used as a design tool, an aid to formulating operational procedures, or a post-mortum analysis to find the cause of an actual failure. The purpose of the measurement dictates the time it should be made. Fault Tree Analysis in the Design/Development phase can identify possible causes of product failure which may then be eliminated. In the Operation/Repair phase, FTA can identify operational factors leading to failure in order to formulate corrective actions. FTA can also be used after a failure occurs to determine the cause for establishing product liability responsibilities.

3.5.4 Application Guidelines. A basic fault tree relates an undesired event to possible causes through a tree-like network branching at "AND gates" and "OR gates." Figure 11 shows a partial fault tree for the event that an automobile will not start. It shows the problem may be due to electrical or fuel factors and that one electrical factor could be the combination of a weak battery and an unheated garage, if it is a cold day. Table 6 explains the symbology used.

Figure 11. Example Fault Tree
Figure 11. Example Fault Tree (Click to Zoom)

Table 6. FTA Symbology
Table 6. FTA Symbology

Improving the Product Through FTA. The top event analyzed by the fault tree can be made less likely to happen by corrective actions which eliminate "OR gates" or create "AND gates". For example, Figure 12 is a fault tree for the top event of a fire in a car. Corrective actions based on the tree could include eliminating the possibility of a cigarette ash starting a fire by using non-flammable upholstery (which eliminates an OR gate), or the addition of fuses in the wiring (which creates an AND gate). Either corrective action makes a fire in the car less likely. An FTA after both corrective actions are implemented is shown in Figure 13.

Figure 12. Sample FTA Before Corrective Action
Figure 12. Sample FTA Before Corrective Action (Click to Zoom)

Figure 13. Sample FTA Following Corrective Action
Figure 13. Sample FTA Following Corrective Action (Click to Zoom)

Setting Priorities. In a large fault tree, the most important factors may not be obvious. These can be determined by a qualitative method called cut set analysis, or by a quantitative criticality analysis.

Cut Set Analysis. A cut set is a combination of basic events (the circles in Table 7) that result in the undesired event. When one basic event alone can cause the end event (a cut set of one element), it is referred to as a single point of failure. A minimum cut set is the smallest combination of events that will cause the end event. For example, the basic cut sets of Figure 14a are events 1 and 3; 2 and 4; 3; and 4. Since event 3 is a single point of failure, the cut set (1 and 3) is redundant. Since event 4 is also a single point of failure, the cut set (2 and 4) is also redundant. Hence, the minimum cut sets for Figure 14a are (3) and (4), two single points of failure.

Figure 14a. Example of Minimum Cut Sets
Figure 14a. Example of Minimum Cut Sets (Click to Zoom)

Quantitative Criticality Analysis. When the probability of each basic event can be estimated, it is possible to compute a number, called the criticality, from which the relative importance of the event can be determined. The criticality number is computed by multiplying the probability of the identified basic event by the conditional probability that, given the occurrence of the basic event, the end event will happen. For example, consider the fault tree presented below:

Figure 14b. Fault tree example
Figure 14b. Fault tree example (Click to Zoom)


The number under each basic event is the probability that it will occur. The conditional probability that the end event will occur is determined from probability theory. For example, to determine the criticality of event 1, its probability of occurrence (.01) is multiplied by the probability that the end event will occur, given that event 1 has happened. The latter is computed as follows:

The end event (H) will occur when both events A and B occur. Its probability is the product of the probability that A will occur and the probability that B will occur. Since event A is connected to its causes (events 1 and 2) by an AND gate, its probability is the product of the probability that event 1 will occur and the probability that event 2 will occur. When calculating the criticality of event 1, however, the event is assumed to have occurred and its probability will be set to 1 so the probability of event A, given event 1 has occurred, is simply the probability that event 2 will occur (.03).

Event B is connected by an OR gate to its causing events, so either event 3 or event 4 will cause event B. To calculate its probability, we note that the probability of B occurring is one minus the probability that it will not occur, and that the probability of B not occurring is the product of the probability that event 3 will not occur times the probability that event 4 will not occur. Further, the probability that event 3 (or event 4) will not occur is one minus the probability that it will occur. The probability that event 3 will occur is given as .04, and event 4 is given as .05.

Note that when calculating the criticality of either event 3 or event 4, the probability of event B happening will be one, since either event will cause event B, and the event whose criticality is being computed is assumed to have happened (i.e., has a probability of occurrence of 1.0).

Using the two equations defined in the
  InsightInsight  
insight boxes in the margins, the criticality of each of the four basic events of the example can be computed. The results are given in Table 7, which shows that events 1 and 2 are the most critical, and event 3 is the least critical.

Table 7. Example Criticality Calculations for FTA
Basic Event P(x) P(A)/Xi P(B)/Xi P(H/Xi) Criticality P(Xi) [P(H/Xi)]
1 .01 .03 .09 .0027 .000027
2 .03 .01 .09 .0009 .000027
3 .04 .0003 1. .0003 .000012
4 .05 .0003 1. .0003 .000015

3.6 Prediction

3.6.1 Purpose. A prediction is a means for obtaining a quantified measurement of reliability by analysis, without actually testing the parameter of interest. There are many forms of reliability prediction. Each attempts to use available data to estimate the reliability that a product will experience in use. Table 8 provides an overview of the major prediction types, where "infant mortality" assumes a decreasing product failure rate, "random" assumes a constant product failure rate, and "wearout" assumes an increasing product failure rate.

Table 8. Overview of Major Reliability Prediction Methodologies
Methodology Early Defects Random Events Wearout Description
Empirical   Typically relies on observed failure data to quantify part-level empirical model variables. Premise is that valid failure rate data is available.
Translation   Translates a reliability prediction based on an empirical model to an estimated field reliability value. Implicitly accounts for some factors affecting field reliability not explicitly accounted for in the empirical model.
Physics-of- Failure     Models each failure mechanism for each component, individually. Component reliability is determined by combining the probability density functions associated with each failure mechanism.
Similar Item Data   Based on empirical reliability design data from products similar to the one being analyzed. Product similarity should include complexity, maturity, manufacturing processes, design processes, function, and intended use environment. Uses specific product predecessor data.
Generic System Level Models   Based on empirical reliability field failure rate data on similar products operating in similar environments. Uses generic data from other organizations.
Test or Field Data Product in-house test data is used to extrapolate estimated field reliability of the product.

3.6.2 Benefit. Predictions provide a quantitative measurement when testing is impractical or impossible. They can be the basis of selecting a design option to pursue further, or of measuring life cycle costs and probability of mission success for planning and budgeting before actual data is available.

3.6.3 Timing. A prediction can be made as soon as conceptual or real design data is available. This can occur as early as the Concept/Planning phase of some products. A prediction should be made early enough in the Design/Development phase to support the decision process, but late enough to use as current data as possible. For example, a prediction made to optimize a spare parts list should be made as late as possible to be sure the latest design changes are included. Reliability measurement by prediction should be made as needed until measurement by test is possible.

3.6.4 Application Guidelines. There are many ways to predict reliability, depending on the data available and the assumptions the user is willing to make. However, the method used should be based on the best available data to support that method.

In the Concept/Planning phase, little data is available for measuring reliability. One possible approach is to use data from similar products in service. The predictions will, of course, be only as good as the degree of similarity. Variations on this approach include the adjustment of the field data to account for differences in technology or use conditions between the field units and the proposed product, which, in turn, requires the formulation of adjustment factors from available data. Another variation is the collection of data by function (e.g., signal processing, power supply, etc.) rather than by total product. This can be very effective when the same functional modules are used to create many different products. Successful use of this method requires the creation of a database of field measurements. Warranty data and customer records are potential sources. Table 9 lists several of the advantages and disadvantages of each of the major reliability prediction methodologies.

Table 9. Reliability Prediction Methodologies - Advantages and Disadvantages
Methodology Advantages Disadvantages Comments
Empirical • Based on empirical data representing observed failure rates
• Product complexity implicitly accounted for in parts count
• Accounts for primary factors affecting reliability
• Difficult to keep up to date
• Assumes validity of a predefined model form
• Assumed component effects may actually be product level effects
• Correlated variables mask the effects of individual variables
• Methodology may be industry specific
The basic premise is the availability of valid failure rate data. Relevance of a particular methodology is based on (1) the phase of the product life cycle over which data is collected, (2) the quality of the data (non- relevant failures eliminated), (3) the product repair/replacement philosophy, and (4) the accuracy of the recorded failure data.
Translation
• Provides reasonable estimates of projected product field reliability
• Based on empirical relationships
• Accounts for non-inherent failures
• Requires the prior performance of a reliability prediction
• Valid only for the product types on which the models are based
The premise of this methodology is that the inherent product reliability is different from the achieved field reliability due to non-inherent field failure mechanisms (induced failures, design defects, etc.).
Physics-of- Failure • More accurate than generic models for wearout mechanisms
• Based on fundamental reliability parameters
• Based on component fabrication data
• Can be used only by those with access to detailed fabrication and materials data
• Relatively complex and difficult to use
• Does not address infant mortality and random failures
Identifies potential component failure mechanisms, appropriate time-to-failure distributions for each, and models the appropriate statistical parameters of the selected distribution as a function of component and stress variables.
Similar Systems Data • Based on empirical data
• Reflects actual performance of similar products
• Little analysis is needed
• Use of methodology in non-similar situations is difficult
• Difficult to obtain data to make reasonable estimates
• Requires data on a product of similar complexity, maturity, manufacturing process and design
Useful when (1) similar product exists, (2) there is a sufficient amount (quantity and quality) of empirical reliability data, and (3) a data reporting system exists to capture accurate reliability and maintenance data.
Generic System Level Models • Based on empirical data
• Quick, easy to use
• Account for non-inherent failures
• Cannot perform part trade-off analysis
• Do not provide insight into reliability drivers
• Cannot be used as a tool to enhance product design
Accuracy is typically less than those models which include the predicted MTBF as an input (predictions account for additional factors affecting system reliability).
Test Data • Based on empirical data
• Based on specific product of interest
• Requires translation from test to field conditions
• Test may not reflect actual end-use field stresses
• May not accurately represent the expected total time period of field use
Can include data from product reliability demonstration, reliability growth, and life tests, as well as environmental stress screening and product yield. The use of appropriate test data should consider (1) translation of the observed data in accordance with the stresses during test vs. those in actual use and (2) identification of the appropriate failure type (defect driven (infant mortality), event driven (random), or wearout) that is addressed by each test type.

The reliability data needed for prediction models can be derived in several ways. For example, engine reliability predictions can be based on experience on in-service aircraft, or on a Weibull plot of a set of engines on test. For components such as engines, where wearout modes of failures predominate, the age of the component at the start of product use is important. Where preventive maintenance is performed to preclude wearout failures, it may be possible to assume the failure rate will be close to constant with time, which will greatly simplify the calculations.

If reliability is defined as the probability of successful operation for a defined period of time in a defined use environment, the reliability of a product (any combination of components) can be analytically measured from the reliability of each component. Table 10 provides an illustration of the format and application of several product reliability measurement models.

More complex models than those presented in Table 10 consider standby redundancy (where "backup" components within the product are not turned on until one of the operating components fail) and cases where failed components can be repaired while the product is operating (see references).

3.7 Sneak Circuit Analysis (SCA)

3.7.1 Purpose. Sneak circuit analysis (SCA) is a measurement tool which yields opportunities for improvement rather than numerical parameters. It is a means of identifying design flaws which result in an electronic product performing unintended functions, or failing to perform intended functions, because of unintended signal paths (sneak circuits).

3.7.2 Benefit. Once a product is in use, the correction of a sneak circuit is costly and could be impractical. In complex products, potential sneak circuits are not obvious and may not reveal themselves until the product is in use. SCA provides a means of identifying problems while the design can still be changed. Because it has been an expensive procedure, it has typically been applied only to products in critical applications, such as ballistic missiles and medical electronics.

3.7.3 Timing. SCA is best performed when the design is complete (near the end of the Design/Development phase) but not yet in fabrication. It may be performed earlier in design/development but should be updated as the design changes in order to remain effective. The later the performance of SCA after the design has been "finalized", the more difficult and costly are the implementation of changes.

3.7.4 Application Guidelines. SCA is performed by determining the topology of a circuit and applying appropriate SCA evaluation rules.

Topology. The effects of topology are best illustrated with an example. Figure 15a shows the simplest topology, a single line of elements from power to ground. In this case, the elements are a switch to lower the landing gear of an airplane and the landing gear itself. There is no chance of an unintended signal path and no timing problem, so SCA would only consider sneak labels and indications (e.g., is it obvious which position of the switch makes the gear go down?) Figure 15b is a more complex topology. Here, a switch is added to open the cargo door. The door cannot be opened unless the gear is down, as the door would normally only be opened after landing. This "ground dome" could present some other potential problems, such as the possibility that if the door switch is closed before the gear switch, the combined load could overload the power supply.

Excerpt from "Table 10. Product Reliability Prediction Measurements" See Full Version
Table 10. Product Reliability Prediction Measurements


Figure 15a. Simple SCA Topology (Example)
Figure 15a. Simple SCA Topology (Example) (Click to Zoom)


Figure 15b. Moderately Complex SCA Topology (Example)
Figure 15b. Moderately Complex SCA Topology (Example) (Click to Zoom)

Figure 15c presents a still more complex topology. Another switch has been installed so that the cargo door can be opened in emergency situations without lowering the gear. The resulting topography presents the greatest risk for sneak circuit failures. If the emergency switch and the normal door open switch are both closed, the landing gear will come down. One possible fix to this sneak path would be to add a device which would prevent current flow from right to left through the normal door open switch. If the power is DC, a diode in series with the switch would eliminate the potential for the sneak circuit.

Figure 15c. Complex SCA Topology
Figure 15c. Complex SCA Topology (Click to Zoom)

3.8 Worst Case Circuit Analysis (WCCA)

3.8.1 Purpose. Worst Case Circuit Analysis (WCCA) measures the performance of an electronic circuit when part parameters vary from nominal values due to normal statistical variation, aging, operational stresses and environmental effects.

3.8.2 Benefit. WCCA provides assurance that a circuit will meet requirements under the worst combination of conditions in expected usage, or provides the information needed for corrective action. WCCA can also identify parts whose established derating limits can be exceeded under worst possible combinations of the expected operational conditions and the variations in the parameters of other parts in the circuit.

3.8.3 Timing. WCCA is perhaps best used as one of the tools of the designer in "measuring" design options during the product Design/Development phase. Later use, for example by a safety analyst or chief designer at the end of product development, will require complete schematics and should be redone if any changes are made. As usual, the farther along the product development, the more difficult and expensive it will be to make changes. Hence, the WCCA should be applied as early as possible after a proposed circuit design is complete. WCCA can also have value in determining the cause of a field failure, but it is far more effective to use it earlier to prevent the failure.

3.8.4 Application Guidelines. There are various ways to perform a WCCA, with differing advantages and penalties. Most WCCA is done using computer tools, and a circuit analysis program compatible with Computer Aided Design software has a decided advantage.

Extreme Value Analysis. Extreme value analysis (EVA) analyzes a circuit output with all variables set to the worst possible values. As an example, the output frequency of an electronic filter will vary as the parameters of its components vary away from nominal. To perform an EVA, the worst expected values of each component, in both directions from nominal, should be determined. The output frequency is then calculated (1) with all the components at their extreme values in the direction which would increase the output frequency and (2) with all components at their extreme values in the direction which would decrease the output frequency. The calculated values are then compared to the specified limits to evaluate the robustness of the circuit. If the frequency is within specified limits when the components are at extreme values, part variation should be no problem in normal operation.

Root-Sum-Squared. Root-Sum-Squared (RSS) analysis recognizes that it is rare for all parameters of a part to simultaneously drift to extreme values. While some variation is biased in a single direction, other changes vary randomly in direction, sometimes helping to compensate for bias variations and sometimes adding to the bias. As an example, the initial value of a capacitor will likely vary in a manner described by a normal curve whose mean is the nominal value. The extreme values of this distribution are ordinarily taken as the values at plus and minus three standard deviations from the mean (the points between which 99.7% of the values will lie). In RSS analysis, the extreme value of each random variation is squared, these values added, and the square root taken of the total. The resulting value is the maximum variation expected due to random factors. This is added to the bias variations to calculate the maximum and minimum worst cases. The process is illustrated in Table 11.

Table 11. RSS Calculation for a Capacitor (Example)
Parameters: Capacitance Bias (%) Random (%)
Neg. Pos.
Initial Tolerance at 25°C -- -- 20
Low Temperature (-20°C) 28 -- --
High Temperature (+80°C) -- 17 --
Other Environments (Hard Vacuum) 20 -- --
Radiation (10KR, 1013 N/cm2) -- 12 --
Aging -- -- 10
TOTAL VARIATION 48 29 (20)2 + (10)2 = 22.4

The worst case minimum value of capacitance would be the nominal value minus the negative bias variations, minus the random variation, or:

Worst case minimum = Nominal (1 - bias variation - random variation)
  = Nominal (1 - .48 - .224) = Nominal (1 - .704)

The worst case maximum value would be the nominal value plus the positive bias variation, plus the random variation, or:

Worst case maximum = Nominal (1 + bias variation + random variation)
  = Nominal (1 + .29 + .224) = Nominal (1 + .514)

Monte Carlo. Monte Carlo analysis requires a probability density function for all variations in parameters. Through random selection, values are assigned to each part in the circuit and the output parameter computed. This is repeated many times and the distribution of the results represents the expected distribution of circuit performance in the field. The use of computers is extremely effective for performing the number of iterations required for a thorough Monte Carlo analysis.

3.9 Test Strategy

3.9.1 Purpose. Test strategy (a subset of Life Cycle Planning) is the established strategic plan for cost effectively performing test measurements that add value to a particular product for its customers. Test strategy typically includes all testing done on a product, but this discussion will be limited to reliability test strategy.

3.9.2 Benefits. A test strategy is intended to verify the achievement of product goals, determine shortcomings needing corrective action, and identify opportunities for improvement in an efficient and cost effective manner. A product-specific test strategy is needed to assure adequate confidence in reliability performance, and to avoid unnecessary expenses resulting from excessive scrap or non-value added tasks resulting from inappropriate tests. Measuring is not a trivial expense. On the other hand, should corrective action be needed, a timely measurement can make the difference between an economical fix and one that is expensive or impossible. Program budgets and schedules cannot ignore measurement costs. Hence, a test strategy must be an integral part of program planning, management, and competitive strategy.

3.9.3 Timing. Initial program planning during the Concept/Planning phase should include a test strategy. As the program progresses into Design/Development, changes in the program (e.g., a decision to develop an item rather than buy it off-the-shelf) should be reflected in changes to the test strategy. In turn, changes in the test strategy may need to result from changes in the product budget and schedule. A test strategy, then, is needed at the start of a project, and is subject to change with technical or business circumstances. Every product design or process review should include a conscious decision to review or revise the test strategy.

3.9.4 Application Guidelines. The specific measurements and the means for acquiring them depend on the circumstances of the program. The matrix of Table 12 relates program and product circumstances to their expected impact on the test measurement techniques discussed in this Blueprint.

A "plus" sign (+) indicates that the activity offers value to the program under that circumstance. A "minus" sign (-) means that the activity is probably not cost effective for that circumstance. A "question mark" (?) indicates that the activity may or may not add value for that circumstance, depending on the type of product. The circumstances considered are New Development (i.e., a product to be designed and built for the first time), COTS (an item available as a Commercial Off-the-Shelf product), Safety Critical (e.g., a nuclear plant control system), Dormancy (i.e., an item to be subjected to long periods in storage or otherwise unpowered), Long Life (an item likely to be in service for a relatively long time, such as the B-52 Aircraft), Harsh Environment (high shock,rapid thermal cycling, et. al.), and S/W (Software) Development. As with Life Cycle Planning, the user should not blindly follow this chart. He should decide whether he agrees with the relationships and what weights should be put on them. This requires some familiarity with the methods. Also, other considerations not on the matrix should be identified and considered. These might include suppliers' reputations, the leverage that the manufacturer has with suppliers, the relative importance of reliability and product cost as competitive factors, and the customer's expectations.

Table 12. Test Strategy Planning Matrix
Reliability Test Technique Program/Product Circumstances
New Dev. COTS Safety Critical Dormancy Long Life Harsh Env. S/W Dev.
Accelerated Life Tests ? - + + + - -
RDT/RQT + - + ? ? + +
RGT/TAAF + - ? ? ? + +
PRAT ? ? + ? ? ? ?

Before testing for reliability, the definition of failure should be understood. For example, is a computer crash that is corrected by a re-boot considered a failure? Questions like these should be resolved before the test or resources will be wasted in arguments when "failures" occur. All intended functions of the product should be tested in some realistic way, and the measurement procedures (what is measured, how, when and how often, where, and what limits are considered acceptable) documented and made part of a test plan, if one exists. A realistic simulation of expected product use (e.g. message rate, signal levels, preventive maintenance schedule, etc.) should also be integrated into the test, where feasible.

3.10 Accelerated Life Tests

3.10.1 Purpose. Accelerated life tests are intended to measure the life of a component by testing it at high stress levels and extrapolating the results to normal operating conditions. Accelerated testing is most often done at the part level, or for components under appropriate conditions.

3.10.2 Benefit. Determining the life of highly reliable components under normal conditions often takes so long as to be impractical. Since reliability decreases under stress, testing such a component at high stress levels may provide results in a more reasonable period of time. If these results can be correlated to the life expected under normal conditions, it is possible to establish the life of the component within a practical test program.

3.10.3 Timing. Accelerated testing is best done on prototypes of a part or component early in the Design/Development product phase to establish its life characteristic before a significant number are produced, permitting any necessary changes to be made without a large amount of scrap. After a product enters the Production/Manufacturing phase, accelerated testing can be used for lot acceptance testing of parts and components.

3.10.4 Application Guidelines. Accelerated testing is based on the following assumptions:
  • The same failure mechanism will dominate the failures at normal and accelerated conditions. This assumption tends to limit the application to parts rather than assemblies. Accelerated testing of equipment has historically been unsuccessful because of difficulties in deriving correlation factors, most likely because this assumption is not usually valid at the product level.
  • The stress used will accelerate the action of the dominant failure mechanism so that component life will decrease as the stress increases. Obviously, the stress with greatest impact on component life should be the one used in accelerated testing, if possible.
  • The same shape probability distribution of failure will hold at both normal and accelerated conditions, except for a displacement in time. A part with a Weibull distribution of failures, for example, has a probability of failure of:

    F(t) = 1 - e -(t/θ)β

    To successfully derive an accelerated test, the same equation would apply under both normal and accelerated stress conditions, with the characteristic life (θ) changing with stress, but not the value of the shape parameter (β).
Deriving Correlation Factors. The first step in determining correlation factors is to collect life data under different stress conditions. Figure 16 shows a plot of cumulative failures against time for three different stress levels. The plots could represent Weibull plots, or plots on log-log graph paper.

Figure 16. Plot of Cumulative Failures vs. Time
Figure 16. Plot of Cumulative Failures vs. Time (Click to Zoom)

If the plots at different stress conditions are parallel, as they are in Figure 16, the assumptions are satisfied, and a correlation factor can be formulated. The next step is to select a desired life measurement. This could be the median life (the time at which 50% of the population on test has failed), the tenth percentile (often used when a minimum operating life is the parameter of interest) or the characteristic life ( ) of a Weibull plot (the point at which 63% of the population has failed). The time at which the selected life measurement point appears on each graph is then plotted against the value of stress represented by the plot, as shown in Figure 17.

Figure 17. Plot of Selected Life Measurement Points
Figure 17. Plot of Selected Life Measurement Points (Click to Zoom)

If the stress had been temperature and the Arrhenius model was assumed to describe the relationship of life to stress, Figure 17 could have been plotted on special Arrhenius graph paper (commercially available). Otherwise, a graph paper with a log-log scale could be used. The life expected under normal stress conditions can be read by extending the graph from the three measured points as shown in Figure 17. To derive a correlation factor relating life at normal stress to life at any accelerated stress level, the inverse power law can be used:

[ Life(At Normal Stress) ] / [ Life (At Accelerated Stress) ]
      = ( Accelerated Stress / Normal Stress )N

For example, if a product has a life of 1,000 hours at 100°C and a life of 10,000 hours at its normal operating temperature of 50°C, then:

10000 /1000 = (100 / 50)N
or
10 = 2N

Therefore: N ≈ 3.35

Once the value of N is derived, life tests can be run at any stress level and the results translated to life at normal stress (multiply the accelerated test life by N to determine the life at normal stress).

3.11 Reliability Demonstration Testing (RDT)/Reliability Qualification Testing (RQT)

3.11.1 Purpose. Performing dedicated tests to measure the achievement of a reliability goal/objective is called Reliability Demonstration Testing (RDT) or Reliability Qualification Testing (RQT). The latter title is often used to designate testing imposed by a customer. In either case, the purpose is to provide a specified degree of confidence that the desired reliability has been achieved.

3.11.2 Benefits. Requiring the product to pass a RDT means that products which pass the test may be considered to have achieved their specified reliability with an acceptable (and known) risk of error. This permits the manufacturer to market the product, or the customer to begin using it, with confidence that projected performance reliability and estimated replacement or repair costs will be met. The tests also provide data on reliability problems, which can help to correct a product which fails the test, or to improve a product which passes.

3.11.3 Timing. RDT/RQT should be run on the first product prototypes just prior to the Production/Manufacturing product phase, when the available hardware (and software) represents the intended final configuration, but there is still time to correct problems before full-scale production is initiated.

3.11.4 Application Guidelines. In planning RDT/RQT, consideration must be given to assuring that the configuration of the products on test are representative of the production units; establishing appropriate accept/reject criteria based on acceptable risks of error; and providing a test environment that simulates the expected operating conditions.

Selecting the Test Sample. The units on test are intended to be a sample of the production units and should, as far as possible, be identical to production units. The test samples should be subjected to the same assembly processes, acceptance tests and ESS that the production units were exposed to. The RDT/RQT test items should also be randomly selected from the available units to minimize the effects of unknown factors affecting reliability (such as differences in personnel skills, tool parameter drift, etc.).

Establishing Accept/Reject Criteria. In general, a test time and an acceptable number of failures is established. The product passes the test when the established test time elapses before the acceptable number of failures has been exceeded. Parts, especially mechanical parts, can be tested for their characteristic life using a Weibull failure distribution model, while systems are most often tested for mean time between failures (MTBF) using an exponential distribution of failures (i.e., a constant failure rate is assumed). The procedure for creating test criteria will be illustrated using the simplest method, zero failure testing, with the caveat that more complex methods are usually required and that there are compendiums of test plans available which generally make it unnecessary to manually perform the calculations.

For a test where zero failures are expected, suppose that a manufacturer wants to be 90% sure of rejecting a part with a given characteristic life that is considered too low (i.e., a 90% confidence in the test, or, in other words, a 10% risk that a "bad" part would pass the test). The following reliability model can be used:

R = e -(t/θ)^β

Reliability (R) is the probability of no failure of a part operated for a time (t), assuming the part exhibits a Weibull distribution of failures with a characteristic life of (θ). When several parts are tested for the same period of time the model becomes:

R = e -n(t/θ)^β

where (n) is the number of parts on test. Assuming the value of the Weibull shape parameter (β) is known, the value of (R) can be set equal to 0.10 and the equation solved for (t). This gives a value of test time for which (n) parts with the undesired characteristic life (θ) have a probability of 10% of having no failures. Hence, the "bad" parts would be accepted in only 10% of the tests. This test is said to have a 10% "consumer's risk," because the consumer has only a 10% chance of accepting the bad parts (undesired characteristic life). Parts having higher values of (θ) are more likely to pass the test and parts with lower values are less likely to pass. For example, if 100 parts are tested and the estimated value of (β) is 2.0:

R = e -100(t/θ)^2

To reject parts with an undesired characteristic life of ( ), i.e., the "bad" parts, and a consumer's risk of 10%, the following equation would be set up and solved for (t):

.10 = e -100(t/θ)^2

which gives:

      ln(.10) = -100(t/θ)^2 ; -2.3 = -100(t/θ)^2 ; -2.3 / 100 = (t/θ)^2 = .023 ;

      (t / θ) = √.023 = .15 ; t = .15(θ)

Thus, each part should be tested for a time (t) equal to .15( ). If there are no failures, the parts are accepted.

When the value of () is one (i.e., the parts exhibit a constant failure rate), the same procedure can be used to determine a test providing a 10% consumer's risk of accepting an undesired mean time between failure for a product. Because the failure rate does not change with time, it is unnecessary for each unit on test to be tested for the same amount of time. The equation then becomes:

R = e - ( t / MTBF )

where (t) is the cumulative operating time of all units on test and (MTBF) is the mean time between failure that is considered unacceptable.

It is possible that many satisfactory values of life (or MTBF) will have a high probability of rejection, which obviously hurts the producer, and is not in the best interest of the consumer. In the worst case, the test may require the product to have an unrealistically high reliability before it has a reasonable chance of passing. For this reason, tests have been devised which consider both the consumer's risk and a "producer's risk", defined as the probability that the test will reject a given value of life or MTBF that is considered good. To balance the consumer's and producer's risks, it is typically necessary to devise tests permitting a number of failures, and requiring much more test time than a zero failure test.

Balancing the Consumer's and Producer's Risks. Devising a test with both the consumer's and producer's risks at acceptable levels is an iterative process, as illustrated in Figure 18.

Figure 18. Devising an RDT/RQT Test
Figure 18. Devising an RDT/RQT Test (Click to Zoom)

To illustrate the application to reliability testing, assume a constant failure rate applies. The expected number of events (n) then equals (1/MTBF)(t) where (1/MTBF) is the failure rate of one unit and (t) is the cumulative operating time of all units on test. The probability of passing the test is simply the probability of having the allowable number of failures or less, which is calculated using the cumulative Poisson distribution. Table 13 provides an overview of these calculations.

Table 13. Calculation of Consumer Risk, Test Length and Producer's Risk
Table 13. Calculation of Consumer Risk, Test Length and Producer's Risk

Fixed-Length Tests. A number of reliability tests based on different risks and different discrimination ratios (i.e. the ratio of the "good" MTBF to the "bad" MTBF) are compiled in Table 14 and are based on an assumption of constant failure rate. It should be noted that some references use (θ0) for the unacceptable MTBF and (θ1) for the acceptable MTBF.

Table 14. Fixed Length Test Plans
Nominal Decision Risks Discrimination
Ratio
θ01
Test Duration
(Multiples
of θ1)
Accept-Reject Failures
Producer's Consumer's Reject
(Equal or More)
Accept
(Equal or Less)
10% 10% 1.5 45.0 37 36
10% 20% 1.5 29.9 26 25
20% 20% 1.5 21.5 18 17
10% 10% 2.0 18.8 14 13
10% 20% 2.0 12.4 10 9
20% 20% 2.0 7.8 6 5
10% 10% 3.0 9.3 6 5
10% 20% 3.0 5.4 4 3
20% 20% 3.0 4.3 3 2
30% 30% 1.5 8.0 7 6
30% 30% 2.0 3.7 3 2
30% 30% 3.0 1.1 1 0

Sequential Tests. For low risks and/or small discrimination ratios, the test time indicated in Table 14 may be excessively long. To reduce the time needed for a decision, more products can be added to the test, or, where applicable, accelerated test methods may be employed. Another solution is the sequential test. Using this test, it is not necessary to wait for the allowable number of failures to occur or the scheduled test time to elapse before making a decision. Instead, if a combination of failures and elapsed time is more likely to occur when the test units have the unacceptable MTBF, compared to those having the acceptable MTBF, a reject decision can be made. Where a combination of failures and elapsed time is more likely to occur when the test units have the acceptable MTBF, as opposed to the unacceptable MTBF, an accept decision is made. Where neither condition is satisfied, the test continues until an arbitrary truncation point is reached (used to assure it will never run significantly longer than a fixed time test based on the same risks and discrimination ratio.) Figure 19 illustrates a sequential test. Failures are plotted against time, and when the plot "escapes" the continue test region, a decision is made to accept or reject the product, as appropriate. A number of sequential tests are summarized in Table 15.

Figure 19. Typical Sequential Test
Figure 19. Typical Sequential Test (Click to Zoom)

Table 15. Sequential Test Plans
Nominal Decision Risks Discrimination
Ratio
θ01
Time to Accept Decision in MTBF
(θ1 Multiples)
Producer's Consumer's Min Exp1 Max2
10% 10% 1.5 6.6 25.95 49.5
20% 20% 1.5 4.19 11.4 21.9
10% 10% 2.0 4.40 10.2 20.6
20% 20% 2.0 2.80 4.8 9.74
10% 10% 3.0 3.75 6.0 10.35
20% 20% 3.0 2.67 3.42 4.5
30% 30% 1.5 3.15 5.1 6.8
30% 30% 2.0 1.72 2.6 4.5
Notes:
  1. Expected test time assumes a true MTBF is equal to θ0 (acceptable MTBF)
  2. Arbitrary truncation point
Test Environment. It is recommended that products being tested be subjected to an environment that represents the expected use environment as closely as possible. Considerations include:
  • Ambient temperature (normal and extreme)
  • Temperature cycling (limits and rate of change)
  • Hot/cold soaks
  • Power cycling (on-off cycles)
  • Power fluctuations
  • Vibration (sine, swept or random; levels; duration; axis)
  • Mechanical shock (waveform; levels; duration; axis)
  • Special conditions as applicable (e.g., high/low humidity, air pressure,
  • corrosive fumes, salt spray, radiation, etc.)
Corrective Action. When planning an RDT/RQT, another factor to consider is what happens if the product fails. Obviously, the reliability problems identified must be fixed, but there are other questions, such as, does production stop until the problems are fixed, and is a retest required? All cognizant personnel should know what needs to be done when a failure occurs.

3.12 Reliability Growth Testing (RGT)/Test, Analyze and Fix (TAAF)

3.12.1 Purpose. A test conducted specifically to measure improvements in reliability by finding and fixing deficiencies is called a Reliability Growth Test (RGT), which is the basis of a Test, Analyze and Fix (TAAF) program. A growth test provides an estimate of what the current product reliability is.

3.12.2 Benefit. RGT/TAAF can be used to prevent reliability problems on new products, and to improve existing products with inadequate reliability. Dedicated reliability growth tests can prevent the delivery of unsatisfactory products to the customer, saving repair/replacement costs and customer dissatisfaction. They differ from demonstration and qualification tests in that their purpose is to gather information rather than to provide confidence about reliability.

3.12.3 Timing. Growth tests require prototype samples to test and time to formulate and implement changes based on the test results, so they should be considered in the latter stages of Design/Development. They should precede any qualification tests, which, if performed, should serve to demonstrate that the growth program was satisfactory. When reliability is inadequate, and a growth test has not preceded a demonstration test, the demonstration test becomes a growth test, as corrective action is taken to improve products which failed the demonstration tests. This unplanned improvement program is usually expensive in terms of time and money. Many manufacturers perform growth testing in lieu of demonstration testing, letting the measurements from the growth test provide assurance in the achievement of adequate reliability. This is possible, but can be counterproductive if the philosophy changes from gathering information (where failures are welcome and will be analyzed for cause) to selling the product (where failures are unwelcome and may be ignored).

3.12.4 Application Guidelines. It is expected that testing a product will result in failures and that finding and correcting the cause of the failure will result in improved reliability. The question of how long of a growth test is required to meet a desired reliability goal is addressed by reliability growth theories. The two most implemented methodologies are the Duane and the AMSAA growth models.

Duane Model. The first theory of reliability growth was developed by James T. Duane, who noted that the reliability of products in development tests, as measured by failure rate, plotted as a straight line against cumulative test time (the total test time obtained by adding the time on all units) on log-log paper. The characteristics of the cumulative and instantaneous failure rates of the Duane model are presented in Table 16.

Table 16. Duane Model - Cumulative and Instantaneous Failure Rates
Characteristics General Form Example
Cumulative Failure Rate
• Includes effects of all failures, including those whose root cause has been eliminated through corrective action implementation and verification

• Pessimistic indicator of the current product failure rate.
λcum = K T

where,

α = Growth Rate
K = Initial Failure Rate
T = Test Time
Assume the initial failure rate (K) is 0.01 failures per hour, the growth rate (α) is equal to 0.5, and the elapsed test time (T) is 1,000 hours.

The cumulative failure rate at 1,000 hours is:

λcum = (.01)(1000)-0.5
             = (.01)(.03)
             = .0003 failures per hour
Instantaneous Failure Rate
• Represents the failure rate expected at a particular time

• Defined as the rate of change of the number of failures as a function of time
λinst = K (1 - α)T

where,

α = Growth Rate
K = Initial Failure Rate
T = Test Time
Assume the initial failure rate (K) is 0.01 failures per hour, the growth rate (α) is equal to 0.5, and the elapsed test time (T) is 1,000 hours.

The instantaneous failure rate at 1,000 hours is:

λinst = (.01)(1 - 0.5)(1000)-0.5
            = (.01)(.5)(.03)
            = .00015 failures per hour

Instantaneous failure rate plotted against cumulative test time is also a straight line on log-log paper, parallel to the cumulative failure rate plot. An example is shown in Figure 20.

Figure 20. Example Duane Growth Plot
Figure 20. Example Duane Growth Plot (Click to Zoom)

To predict how long of a growth test is required to achieve a desired failure rate (or MTBF), the plot of the instantaneous failure rate (or MTBF) can be extended until it intersects the desired value, with the corresponding cumulative test time read from the x-axis. Alternately, the data points can be fitted to a straight line and the intersect point calculated. The equations required to apply this methodology are shown in Table 17.

Table 17. Equations for Calculating Duane Growth Parameters of Reliability
Y = C1 + C2 X Equation for a straight line, where,
Y = log of the cumulative failure rate
C1 = log of K (initial failure rate)
C2 = -α (slope of line)
X = log of the cumulative test time
Equation Equation to compute slope, where,
Xi = log of individual failure time
Yi = log of cumulative failure rate at Xi failure time
n = number of recorded failures
C1 = Y - C2 X Equation to compute intercept, where,
Y = mean value of Yi
X = mean value of Xi

The three equations define a line fitting the data with least square deviation from the data points. Table 18 shows the calculation of the needed parameters from a sample set of data.

Table 18. Calculation of Duane Reliability Growth Parameters
Cumulative Failure Count Cumulative Test Time Log Test
Time (X)
X2 λ Log λ (Y) X • Y
1 1 0 0 1.0 0 0
2 4 .60206 .36248 0.500 -.30102 -.1812
3 8 .90309 .81557 0.375 -.42597 -.3847
4 13 1.1139 1.2408 0.308 -.51145 -.5697
5 20 1.3010 1.6926 0.250 -.60206 -.7833
6 30 1.4771 2.1819 0.200 -.69897 -1.032
7 42 1.6232 2.6348 0.167 -.77728 -1.262
8 57 1.7559 3.0832 0.140 -.85387 -1.499
9 78 1.8921 3.5797 0.115 -.93930 -1.777
10 104 2.0170 4.0683 0.0962 -1.0168 -2.051
11 136 2.1335 4.5518 0.0809 -1.0921 -2.330
12 177 2.2480 5.0535 0.0678 -1.1688 -2.627
13 228 2.3579 5.5597 0.0570 -1.2441 -2.933
14 292 2.4654 6.0782 0.0479 -1.3197 -3.253
15 372 2.5705 6.6075 0.0403 -1.3947 -3.585
16 473 2.6749 7.1551 0.0338 -1.4711 -3.935
17 599 2.7774 7.7140 0.0284 -1.5467 -4.296
18 757 2.8791 8.2892 0.0238 -1.6234 -4.674
19 956 2.9805 8.8834 0.0199 -1.7011 -5.070
20 1205 3.0810 9.4926 0.0166 -1.7799 -5.484
21 1518 3.1813 10.121 0.0138 -1.8601 -5.918
22 1879 3.2739 10.718 0.0117 -1.9318 -6.325
23 2262 3.3545 11.253 0.0102 -1.9914 -6.680
24 2668 3.4262 11.739 0.00899 -2.0462 -7.011
25 3099 3.4912 12.188 0.00807 -2.0931 -7.307

      ∑X = 55.58

      ∑Y = -30.39

      X = (∑X) / 25 = 2.223

      Y = (∑Y) / 25 = -1.2156

      ∑X2 = 145.1

      ∑(X •Y) = -80.97

Using the values from Table 18,
      C2 = -α = [ -80.97 - ( (55.58 * -30.39) / 25 ) ] / [ 145.1 - 25(2.223)2 ] = -0.62, or α = 0.62

      C1 = log(K) = 1.2156 - (-.62)(2.223) = 0.16, or K=10.16 = 1.45

At the last failure, 3,099 hours of test time had been accumulated. At that time:
      λcum = (1.45)(3099)-.62 = .0099
      λinst = (1.45)(1-.62)(3099)-.62 = .00377

Planning the length of a growth test before data is available requires the estimation of (K) and (α). These are best obtained from experience of the manufacturer in past growth programs. Historically, (α) has ranged from about 0.3, with 0.6 being a reasonable estimate of the maximum growth that could be realistically expected. The value of (K) has been observed to be as low as 10% of predicted reliability, but this does not account for current technology, such as computer aided design techniques, which effectively start the growth process when the product has only a conceptual existence.

Duane plots can be made using MTBF rather than failure rate as the parameter of interest. Since MTBF = 1/(λ), the log of the reciprocal of the failure rate is used for the Y-axis, and the plot goes up with time (slope is positive).

AMSAA Growth Model. Dr. Larry Crow, while at the U.S. Army Material Systems Analysis Activity (AMSAA), modeled growth as a non-homogeneous Poisson process with the equations given in Table 19.

Table 19. AMSAA Growth Model Characteristics
General Form Example
Cumulative Failure Rate
λcum = λT

where,

β = Growth Rate
λ = Initial Failure Rate
T = Test Time
Assume the initial failure rate ( ) is 0.01 failures per hour, the growth rate ( ) is equal to 0.5, and the elapsed test time is 1,000 hours.

The cumulative failure rate at 1,000 hours is:

λcum = (.01)(1000)(.5-1) = (.01)(1000)(-.5)
            = .0003 failures per hour
Instantaneous Failure Rate
λinst = λβTβ-1

where,

β = Growth Rate
λ = Initial Failure Rate
T = Test Time
Assume the initial failure rate (λ) is 0.01 failures per hour, the growth rate (β) is equal to 0.5, and the elapsed test time is 1,000 hours.

The instantaneous failure rate at 1,000 hours is:

λinst = (.01)(.5)(1000)(.5-1)
           = (.01)(.5)(.03)
           = .00015 failures per hour
The parameters (λ) and (β) are estimated from the maximum likelihood formulas:
Equation Equation to compute slope, where,

N = number of recorded failures
T = total test time
Xi = time at which an individual failure occurs
λ = N / Tβ Equation to compute intercept, where,

N = number of recorded failures
T = total test time
β = computed slope

Given these two parameters, the instantaneous failure rate equation can be used to estimate the time required to achieve a given failure rate. The AMSAA model also plots as a straight line on log-log paper for both cumulative and instantaneous failure rates.

Table 20 shows calculations of (λ) and (β) using the same set of data that was used in the discussion of the Duane model (Table 19).

Table 20. Calculation of AMSAA (λ) and (β) Parameters
Cumulative Failure
Count
Cumulative Test
Time (Xi)
Xn / Xi ln (Xn / Xi)
1 1 3099 8.0388
2 4 774.75 6.6525
3 8 387.38 5.9594
4 13 238.38 5.4739
5 20 154.95 5.0431
6 30 103.30 4.6376
7 42 73.786 4.3012
8 57 54.368 3.9958
9 78 39.731 3.6821
10 104 29.798 3.3944
11 136 22.787 3.1262
12 177 17.508 2.8627
13 228 13.592 2.6095
14 292 10.613 2.3621
15 372 8.3306 2.1199
16 473 6.5518 1.8797
17 599 5.1736 1.6436
18 757 4.0938 1.4095
19 956 3.2416 1.1761
20 1205 2.5718 0.94460
21 1518 2.0415 0.71369
22 1879 1.6493 0.50034
23 2262 1.3700 0.31483
24 2668 1.1615 0.14975
25 3099  
Equation
Using the values from Table 20:
      β = 25 / 72.99 = .34 ; λ = 25 / (3099).34 = 1.625

At the end of the test
      λcum = (1.625)(3099) (.34 - 1) = .008066 failures/hour

      λinst = (.34)(1.625)(3099) (.34 - 1) = .0027 failures/hour


3.13 Production Reliability Acceptance Test

3.13.1 Purpose. A Production Reliability Acceptance Test (PRAT) is performed to measure any degradation in the reliability of a product over the course of production or to assure that products being delivered meet customers' reliability requirements and/or expectations.

3.13.2 Benefits. When a product is available in the marketplace, any delay in finding a solution to a reliability problem results in a proportionate number of dissatisfied customers. This is usually costly, and can often be disastrous. Companies have gone out of business because products were sold with a serious undiscovered and unsolved reliability problem that become evident during customer use. PRAT is intended to minimize the impact of production reliability problems by providing timely warning and the data needed for corrective action.

3.13.3 Timing. PRAT only takes place during the Production/Manufacturing phase of the product life cycle. Depending on the method used, PRAT can be periodic or continuous during production, and can be done on a sample basis (for high volume production) or at 100% (for low volume, complex, expensive products).

3.13.4 Application Guidelines. There are at least four different approaches of testing during production, each with certain advantages and disadvantages.

Periodic Repetition of the RQT. The simplest approach to PRAT, assuming that a Reliability Qualification Test (RQT) has been performed, is to repeat the RQT at intervals during the production run. Advantages include the use of a familiar test procedure which the product is known to have already passed. A variation is to use an RQT with higher producer/consumer risk percentages, which has the advantage of shorter test times. The disadvantage of this approach is somewhat subtle: the repetition of a test increases the risk of failure. For example, assume an equipment has an inherent mean time between failure (MTBF) with a 90% probability of passing a defined RQT. If two tests are scheduled during production, the probability that it will pass both, without any change in its MTBF, is (.90)2, or 81%. If six tests are performed over a long production run, the probability that the product will pass all of the tests would be (.90)6, or 53%. Thus, products which would pass a RQT 90% of the time, would have only an even chance of passing six PRATs, with no real change in its inherent MTBF.

The All-Equipment Production Reliability Acceptance Test. As the name implies, every production equipment is subjected to a specified number of hours on test, with the aggregate test time and number of failures used to determine rejection or continued acceptance of the product for shipment. The accept-reject criteria is a modification of the sequential test method described for RDT/RQT. The all-equipment test derived from one sequential test type is illustrated in Figure 21.

Figure 21. Sample Sequential Test Accept-Reject Criteria
Figure 21. Sample Sequential Test Accept-Reject Criteria (Click to Zoom)

During the test, time and failures are plotted. As long as the plot remains within the accept and continue test region, the product is considered acceptable for shipment to customers. Should the plot enter the reject region, shipments would typically be stopped until the reliability problem is rectified. The plot is not allowed to enter the region below the boundary line (if the plot contacts the boundary line, it stays in place until the next failure occurs). Thus, the plot is never farther away from rejection than the distance between the boundary and reject lines. This is to provide fast response to the appearance of a reliability problem.

One problem with the all-equipment test is that the probability of failure is a function of the total test time. If the total test time is short relative to the desired MTBF, the test is easy to pass and even poor values of MTBF will pass too often. If the total test time is long, even acceptable values of MTBF will be rejected. For the test of Figure 21, the probability of acceptance for various test lengths is shown in Figure 22.

Figure 22. Probability of Acceptance for Various Test Lengths
Figure 22. Probability of Acceptance for Various Test Lengths (Click to Zoom)


Bayesian Reliability Testing. Bayesian reliability tests are based on the premise that product data available before testing should be used in conjunction with the test data to decide acceptability. The advantage of such tests are that favorable data before the test (prior) provide a shorter test than less favorable data, and that a new prior is computed after each test based on the test results. Disadvantages are that, despite academic interest, there have been few practical uses made of Bayesian reliability tests, and there are no standard references.

Statistical Process Control. Statistical process control (SPC) has long been used as a means for controlling critical parameters of a product during manufacture. Reliability can be handled by SPC as well as any other parameter.

The theory behind SPC is that measurements on samples of a population will follow a normal distribution whose mean is equal to the mean of the population. As a consequence, sample measurements from a stable process will vary randomly about the mean with 67% of all measurements within plus or minus one standard deviation from the mean, 95% within plus or minus two standard deviations, and 99.7% within plus or minus three standard deviations. Only three measurements in a thousand are expected to exceed plus or minus three standard deviations, so any measurement outside this range is considered an indication that some change has occurred, and investigation is warranted (plus or minus four sigma is often used in the automotive industry.) Also, non-random patterns (e.g., six consecutive samples measuring above the mean) are considered evidence of a change in the process. Figure 23 illustrates a generic SPC chart.

Figure 23. Generic SPC Chart
Figure 23. Generic SPC Chart (Click to Zoom)

To apply SPC to PRAT, a parameter must be selected that represents reliability, and a periodic measurement made on a sample of the product. One of the most common measures is defects per unit, but failure rate per hour could also be measured. To do this, a sample of the product would be operated and the number of failures found would be divided by the total hours accrued among the units in the sample. The standard deviation of the sample can be shown to be the square root of the expected population failure rate divided by the square root of the number of units in the sample. Using plus or minus three standard deviations to define the control limits, we have:

      Upper Control Limit = u + (3√ü) / √n ;
      Lower Control Limit = u - (3√ü) / √n

where,
      u = mean failure rate
      n =number of units in sample

If the lower control limit is calculated as a minus number (or a measured parameter below the lower limit actually represents better product performance), it is set equal to zero since negative failure rates are meaningless. It is not necessary to have the same sample size for each measurement, but unequal sample sizes mean the control limits will change for each measurement.

To illustrate, assume monthly tests are run to measure the failure rate of a product whose expected failure rate is 0.5 failures per thousand hours. Some results are shown in Table 21.

Table 21. Sample Test Results
Month Failures Total Hours
(1,000 Hrs.)
Failure Rate
May 4 10 .40
June 4 7 .57
July 6 8 .75
August 3 7 .43

Since the samples are not equal, separate control limits have to be computed for each month. For example, the control limits for May are:

UCL = .5 + (3√5 / √10) = 1.17 ;
LCL = .5 - (3√5 / √10) = -.17 = 0

The control limits for the remaining data are computed similarly and are shown in Table 22.

Table 22. Calculated Control Limits
Month Sample Size UCL LCL
May 10 1.17 0
June 7 1.30 0
July 8 1.25 0
August 7 1.30 0

The control chart based on this data is shown in Figure 24, which shows no evidence of cause for concern (i.e., no special cause or assignable cause failures).

Figure 24. Process Control Chart
Figure 24. Process Control Chart (Click to Zoom)

The SPC approach does not have the statistical complexity of other methods for production reliability measurement. However, when using it to measure failure rates, it may have a serious disadvantage. This is because the sample size should be large enough so that any significant deviation from expected reliability will cause at least one failure in the sample population. For products with very low failure rates (very high MTBF), it may be impractical to use a large enough sample size. Currently, there is no problem in tracking such measures as defects per unit in large items produced in large numbers, such as automobiles or aircraft.



SECTION FOUR - REFERENCES

The references in Table 23 provide additional information on the subjects discussed in this Blueprint. The relationship between the reference and sections within the Blueprint are indicated in the table for each source.

Excerpt from "Table 23. References for Measuring Product Reliability Blueprint Sections"

Table 23. References for Measuring Product Reliability Blueprint Sections
See Full Version
START 2002-4
Statistical Confidence
Journal Article V9, N3
Statistical Analysis of Reliability Data, Part 2: On Estimation and Testing
START 2003-7
Reliability Estimations for the Exponential Life
START 2002-2
Statistical Assumptions of an Exponential Distribution
Journal Article V14, N4
Electronic Component Failure Rate Prediction
START 96-1
Creating Robust Designs
Journal Article V14, N3
Developing Highly Reliable and Safe Devices
Journal Article V13, N2
Risk Management and Reliability
START 00-3
Environmental Stress Screening
START 96-1
Creating Robust Designs
START 99-4
Accelerated Testing
Journal Article V11, N3
A Beginners Guide to HALT
START 96-1
Creating Robust Designs
Journal Article V14, N4
Electronic Component Failure Rate Prediction
START 97-2
Electronic Reliability Prediction
Journal Article V14, N3
Developing Highly Reliable and Safe Devices
START 2003-8
Use of Bayesian Techniques for Reliability
START 2005-2
Understanding Binomial Sequential Testing
Journal Article V8, N4
Tutorial: Test Risks, Confidence and OC Curves
START 2005-1
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
START 2004-3
Censored Data
Journal Article V12, N1
Multivariable Testing (MVT)
START 2002-4
Statistical Confidence
START 2003-7
Reliability Estimations for the Exponential Life
START 2002-2
Statistical Assumptions of an Exponential Distribution
START 2002-6
Empirical Assessment of Normal and Lognormal Distribution Assumptions
START 2003-3
Empirical Assessment of Weibull Distribution
START 2002-5
Graphical Comparisons of Two Populations
Journal Article V9, N4
Statistical Analysis of Reliability Data, Part 3: On Statistical Modeling of Reliability Data
Journal Article V9, N3
Statistical Analysis of Reliability Data, Part 2: On Estimation and Testing
START 2003-5
Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions
START 2003-6
Kolmogorov-Simirnov: A Goodness of Fit Test for Small Samples
START 2003-4
The Chi-Square: a Large-Sample Goodness of Fit Test
Journal Article V10, N2
Statistics - A Reliability Engineer's Tool, Not Reliability Engineering
Journal Article V7, N3
Reliability Prediction Methods
Journal Article V6, N3
Reliability Growth
Journal Article V6, N3
Reliability Growth
START 99-3
Reliability Growth
START 97-2
Electronic Reliability Prediction
START 2005-2
Understanding Binomial Sequential Testing
Journal Article V8, N4
Tutorial: Test Risks, Confidence and OC Curves
START 2002-4
Statistical Confidence
START 2005-1
Operating Characteristic (OC) Functions and Acceptance Sampling Plans
Journal Article V9, N3
Statistical Analysis of Reliability Data, Part 2: On Estimation and Testing
START 2003-3
Empirical Assessment of Weibull Distribution
Journal Article V11, N2
Reliability Testing of Printed Wiring Boards with Interconnect Stress Testing Technology (IST)
Journal Article V11, N4
Applying RCM Analysis to EA-6B Corrosion Failure Modes
Journal Article V9, N3
Markov vs. FTA
Journal Article V14, N4
Electronic Component Failure Rate Prediction
Journal Article V14, N1
Information Management for Systems Design for RMQSI
Journal Article V7, N4
Engineering Information Assurance into Information Systems
Journal Article V12, N4
Improving Mission Performance & Reducing Total Ownership Cost
START 2004-2
The RMQSI Case - A Reasoned, Auditable Argument Supporting the Contention that a System Satisfies...
Journal Article V11, N2
Methods for Reducing the Cost to Maintain a Fleet of Repairable System
START 00-1
Sustained Maintenance Planning
Journal Article V9, N3
Markov vs. FTA
Journal Article V11, N3
Markov Analysis
Journal Article V12, N1
Petri Nets: An Alternative to Markov Chains
START 2003-2
The Applicability of Markov Analysis Methods
Journal Article V12, N3
Hazardous Events
Journal Article V11, N4
Biomedical Survival Analysis vs. Reliability: Comparison, Crossover, and Advances
Journal Article V7, N3
Reliability Prediction Methods
START 2003-7
Reliability Estimations for the Exponential Life
Journal Article V13, N2
Practical Considerations in Calculating Reliability of Fielded Products
Journal Article V14, N4
Electronic Component Failure Rate Prediction
START 2004-6
Availability
START 2004-3
Censored Data
START 2002-4
Statistical Confidence
START 2004-1
Combining Data
Journal Article V9, N4
Statistical Analysis of Reliability Data, Part 3: On Statistical Modeling of Reliability Data
START 97-2
Electronic Reliability Prediction
START 2004-5
Understanding Series and Parallel Systems Reliability
Journal Article V14, N1
PROTOCOL to Provide AMRDEC a New Reliability Environment
START 96-1
Creating Robust Designs
Journal Article V11, N2
Reliability Testing of Printed Wiring Boards with Interconnect Stress Testing Technology (IST)
Journal Article V12, N4
Random Vibration & Mechanical Shock Excite All Resonances
START 00-3
Environmental Stress Screening
START 2004-7
Derating
START 2002-6
Empirical Assessment of Normal and Lognormal Distribution Assumptions
START 2003-3
Empirical Assessment of Weibull Distribution
START 2002-2
Statistical Assumptions of an Exponential Distribution
START 2003-5
Anderson-Darling: A Goodness of Fit Test for Small Samples Assumptions
START 2003-4
The Chi-Square: a Large-Sample Goodness of Fit Test
Journal Article V8, N1
Testing for MTBF
Journal Article V14, N3
Developing Highly Reliable and Safe Devices
Journal Article V9, N3
Statistical Analysis of Reliability Data, Part 2: On Estimation and Testing
START 00-4
Analysis of ???One-Shot?? Devices
START 2004-3
Censored Data
START 2004-7
Derating
START 99-4
Accelerated Testing
START 2003-7
Reliability Estimations for the Exponential Life
START 2005-1
Operating Characteristic (OC) Functions and Acceptance Sampling Plans