Training and consultancy for testing laboratories.

Archive for April, 2017

Comparing GUM and top down approaches (PDF)

Comparing GUM and top down approaches

Measurement Uncertainty – Comparing GUM and top down approaches

Measurement Uncertainty – Comparing the GUM and ‘top down’ approaches

Upon requests, I tabulate the differences and advantages / disadvantages of the two broad approaches in measurement uncertainty (MU) evaluation processes for easier appreciation.

GUM (bottom up) approach

Top down approaches


Component-by-component using Gauss’ error propagation law for uncorrelated errors


Component-by-component using Gauss’ error propagation law for uncorrelated errors

Which components?

Studying uncertainty contributions in    each step of test method as much as possible


Which components?

Using repeatability, reproducibility and trueness of test method, according to basic principle: accuracy = trueness (estimates of bias) + precision (estimates of random variability)

“Modeling approach” or “bottom up approach”, based on a comprehensive mathematical model of the measurement procedure, evaluating individual uncertainty contribution as dedicated input quantities “Empirical approach” or “top up approach”, based on whole method performance to comprise the effects from as many relevant uncertainty sources as possible using the method bias and precision data. Such approaches are fully in compliant with the GUM, provided that the GUM principles are observed.

Acknowledged as the master document on the subject of  measurement uncertainty


There are few alternative top down approaches, receiving greater attention by global testing community today


GUM classifies uncertainty components according to their method of determination into type A and type B:

Type A – obtained by statistical analysis

Type B – obtained by means other than statistical analysis, such as transforming a given uncertainty (e.g. CRM) or past experience




Top down approaches consider mainly Type A data from own statistical analysis from within-lab method validation and inter-laboratory comparison studies



GUM assumes that systematic errors are either eliminated by technical means or corrected by calculation.



The top down approaches allow for method bias in uncertainty budget



In GUM, when calculating the combined standard uncertainty of the final test result, all uncertainty components are treated equally


The top down approach strategy combines the use of existing data from validation studies with the flexibility of additional model-based evaluation of individual residual effect uncertainty contributions.


1.   Demanding critical assessment and full understanding of the analytical steps in a test method

2.   Consistent with other fields of measurements such as calibration

3.   The MU result generated is relevant to the particular laboratory that produces it







1.   Quality data from method validation and inter-lab comparison studies are readily available in a well run accredited laboratory

2.   Very much simpler process in MU evaluation

3.   The MU data of a test method is dynamic and current, due to using existing and experimentally determined quality control checks and method validation results

4.   This approach is based on statistical analysis of data generated in intra- and inter-laboratory collaborative studies on the use of a method to analyze a diversity of sample matrices.



1.   The GUM approach process is tedious and time consuming

2.   This methodology may underestimate the measurement uncertainty, partly because it is hard to include all possible uncertainty contributions

3.   GUM may unrealistically assume certain errors are random (i.e. normally distributed) and independent

4.   GUM provides a broad indication of the possible level of uncertainty associated with the method rather than a measurement.

5.   It does not take into account either matrix-associated errors or the actual day-to-day variation seen in a laboratory

6.   GUM does not apply well when there is no mathematical model in the test method


1.   The top down approach may not by itself identify where the major errors could be occurring in process and the results generated are the products of technical competence of the laboratory concerned

2.   That inter-lab reproducibility data considered in certain instances may not be fully representative for variability of results on actual samples, unless it is standardized






Measurement Uncertainty – Top Down Approaches


Top down approaches of Measurement Uncertainty

With regards to the subject of measurement uncertainty (MU) evaluation for chemical, microbiological and medical testing laboratories, I have been strongly advocating the holistic “top-down”  approaches which consider the test method performance as a whole, making use of the routine quality control (QC) and method validation data, instead of the time consuming and clumsy step-by-step ISO GUM “bottom-up” approach which study the uncertainty contributions in each and every step of the laboratory analysis before summing up for the expanded uncertainty of the method.

Indeed, the top-down MU approach evaluates the overall variability of the analytical process in question.

Your ISO 17025 accredited laboratory should have a robust laboratory quality management control system in place, and carrying out regular laboratory control check sample analyses on each and every test method. This is your routine practice to ensure the reliability and accuracy of the test results reported to the end users.

Over time, you would have collected a wealth of such QC information data on the performance of all your test methods. They are sitting around in your LIMS or other computer system.  Furthermore, you would have carried out your method validation or verification in your laboratory practice.  What you really need to do is to re-organize all these QC data to your benefits with the help of a PC or laptop computer.

Of course, the top-down approaches like GUM have certain limitations, such as the possible mismatching of your sample matrices against the laboratory reference check samples, or relatively unstable reference materials. But these issues have been addressed by various top-down methods with an addition of the uncertainty contribution in actual sample analysis.

I list below some good references of the top-down approaches which are widely practiced by our US and EU peers. In Asia, I reckon China CNAS is probably the only national accreditation body leading the local testing industry in adopting these approaches. The other national accreditation bodies seem to be slow in promoting this noble subject for whatever reason.

  • ISO 21748:2010 Guidance for the use of repeatability, reproducibility and trueness estimates in measurement uncertainty estimation
  •  ISO 11352:2012 Water quality — Estimation of measurement uncertainty based on validation and quality control data
  •  ISO 11095:1996 (2012) Linear calibration using reference materials
  •  ISO 19036:2006 Microbiology of food and animal feeding stuffs — Guidelines for the estimation of measurement uncertainty for quantitative determinations Amd1:2009: Measurement uncertainty for low counts
  •  BS 8496:2007 Water quality. Enumeration of micro-organisms in water samples
  • ASTM D6299-08 Applying statistical quality assurance and control charting techniques to evaluate analytical measurement system performance
  • ASTM D2554-07 Estimating and monitoring the uncertainty of test results of a test method in a single laboratory using a control sample program
  •  ASTM E2093-05 Optimizing controlling and reporting test method uncertainty from multiple workstations in the same laboratory organization
  • EuroLab Technical Report No. 1/2006 Guide to the Evaluation of Measurement Uncertainty for Quantitative Test Results
  •  NORDIC Technical Report TR 537 Edition 3.1 Handbook for Calculation of Measurement Uncertainty in Environmental Laboratories
  • A2LA G108 – Guidelines for Estimating Uncertainty for Microbiological Counting Methods
  • CNAS-GL34:2013 基于质控数据环境检测测量不确定度评定指南 Guidance for measurement uncertainty evaluation based on quality control data in environmental testing


DOE – Why choosing 2-level factorial design?

                              DOE – Why choosing 2-level factorial design?

An ideal experimentation is to have an experimental design with relatively few runs that could cover many factors affecting the desired result.  But, it is impossible to accurately control and manipulate “too many” experimental factors at a time. Worse still, if we consider the number of levels for each factor, the number of experimental runs required will grow exponentially.

We can illustrate this point easily by the following discussion..

Assume that we have a 2-factor experiment with factor A having 3 levels (say, temperature factor with 30oC, 60oC, 90oC as levels), and factor B with 2 levels (say, catalyst, X and Y). Then, there are 3 x 2 = 6 possible combination of these two factors:

A(30) X

A(60) X

A(90) X

A(30) Y

A(60) Y

A(90) Y

Similarly, if there were a third experimental factor C with 4 levels (say, Pressure, 1 bar, 2 bar, 3 bar, 4 bar), then there would be 3 x 2 x 4 = 24 possible combinations:

A(30) X 1bar

A(60) X 1bar

A(90) X 1bar

A(30) Y 1bar

A(60) Y 1bar

A(90) Y 1bar


A(30) X 2bar

A(60) X 2bar

A(90) X 2bar

A(30) Y 2bar

A(60) Y 2bar

A(90) Y 2bar


A(30) X 3bar

A(60) X 3bar

A(90) X 3bar

A(30) Y 3bar

A(60) Y 3bar

A(90) Y 3bar


A(30) X 4bar

A(60) X 4bar

A(90) X 4bar

A(30) Y 4bar

A(60) Y 4bar

A(90) Y 4bar

Therefore, the general pattern is obvious: if we have m factors F1, F2, …, Fm with number of levels k1, k2, …, km each respectively, then there are k1 x k2 x … x km combined possible runs in total. Note that if the number of levels is the same m for each factor, then the product of this combination is just k x k x … x k (m times) or km.

Let’s see how serious exponential growth of the number of experimental runs is when the number of levels and factors increase:

  • If level k = 2, a 3 factor experiment has 23 = 8 possible runs
  • If level k = 2, a 4 factor experiment has 24 = 16 possible runs
  • If level k = 2, a 5 factor experiment has 25 = 32 possible runs
  • If level k = 3, a 3 factor experiment has 33 = 27 possible runs
  • If level k = 3, a 4 factor experiment has 34 = 81 possible runs
  • If level k = 3, a 5 factor experiment has 35 = 243 possible runs

Clearly, if we want to run all possible combination in an experimental study, a so-called full-factorial experiment, the number of runs gets too large to be practical for more than 4 or 5 factors with 2 levels, and much more so at 5 factors with 3 levels.

You may then ask: if I were to stick to 2 level designs, could I find ways to control the number of experimental runs?

The answer is yes provided you are able in some way to select a subset of the possibilities in some clever way so that “most” of the important information that could be obtained by running all the possible combinations of factor settings is still gained, but with a drastically reduced number of runs. One may do so by basing on his previous scientific knowledge, theoretical inferences and/or experience. But, this will not be easy.

Indeed, there is a way to do so, that is to follow the Pareto Principle on 20-80 rule which defines as below:

The Pareto Principle  For processes with many possible causes of variations, adequately controlling just the few most important is all that is required to produce consistent results. Or, to express it more quantitatively (but as a rough approximation, only), controlling the vital 20% of the causes achieves 80% of the desired effects.

In other words, in order to get the most bang for your buck, you should focus on the important stuff and ignore the unimportant. Of course, the challenge is determining what is important and what is not.

If you can do so, you can now have a 2 level factorial design happily covering say 10 factors to be studied in 8 runs on 3 important factors only instead of 210 = 1024 runs!


Some essential terminology and elements in DOE

Some essential terminology and elements in DOE

    Let’s look into some key operational characteristics that we think a good experiment should have in order to be effective.  Among these, we would certainly include, but not limit to, subject matter expertise, “GLP” (Good Laboratory Practices), proper equipment maintenance and calibration, and so forth, none of which have anything to do with statistics.

But, there are other operational aspects of experimental practice that do intersect statistical design and analysis, and which, if not followed, can also lead to inconsistent, irreproducible results.  Among these are randomization, blinding, proper replication, blocking and split plotting.  We shall discuss them in due course.

To begin with, we need to define and understand some terminology. It should be emphasized that the DOE terminology is not uniform across disciplines and even across textbooks within a discipline.  Common terms are described below.

  1. A response variable is an outcome of an experiment. It may be a quantitative measurement such as percentage by volume of mercury in a sample of river water, or it may be a qualitative results such as absence/presence of urine sugar.
  1. A factor is an experiment variable that is being investigated to determine its effect on a response. It is important to realize that a factor is considered controllable by the experimenter, i.e. the values, or “levels”, of the factor can be determined prior to the beginning of the test program and can be executed as stipulated in the experimental design. Examples of a factor can be temperature, pressure, catalyst.
  1. We use “level” to refer to the value of both qualitative and quantitative factors. Examples of “levels” of a temperature can be Control temperature CTRL and CTRL+20o
  1. Additional variables that may affect the response but cannot be controlled in an experiment are called covariates. Covariates are not additional responses, that is, their values are not affected by the factors in the experiment. Rather, covariates and the experimental factors jointly influence the response.

For example, in an experiment involving temperature and humidity factors, we can only control the temperature of the laboratory equipment but humidity can be measured but not controlled. In such experiment, temperature would be regarded as an experimental factor whilst humidity as a covariate.

  1. A test run is a single factor-level combination for which an observation (response) is obtained.
  1. Repeat tests are two or more observations that are obtained for a specified combination of levels of the factors. Repeated tests are actually distinct test runs, conducted under as identical experimental conditions as possible, but they need not be obtained in back-to-back test runs.
  1. Replications are repetitions of a portion of the experiment (or the entire experiment) under two or more different conditions, for example, on two or more different days.

It is to be noted that repeat tests and replication increase precision by reducing the standard deviation of the statistics used to estimate effects (see point 12 below).

  1. Experimental responses are only comparable when they result from observations taken on homogeneous samples or experiment units. The homogeneous samples do not differ from one another in any systematic manner and are as alike as possible on all characteristics that might affect the response.
  1. If experiment units (or samples) produced by one manufacturer are compared with same experiment units produced by a second manufacturer, any differences noted in the responses for one level of a factor could be due to the different levels of the factor, to the different manufacturers, or to both. In this situation, the effect of the factor is said to be cofounded with the effect due to the manufactures.
  1. When a satisfactory number of homogeneous experimental units cannot be obtained, statistically designed experiments are often blocked so that homogeneous experimental units received each level of the factor(s). Blocking divides the total number of samples into two or more groups or blocks (e.g. manufacturers) of homogeneous experimental units so that the units in each block are more homogeneous than the units of different block. Hence, blocking increases response’s precision (decreases variability) by controlling the systematic variation attributable to non-homogeneous experimental units or test conditions.

A good example is we can separate different ethnic groups in different block for a social behavior study.

  1. The terms designs and layout often are used interchangeably when referring the experiment designs. The layout or design of the experiment includes the choice of factors-level combinations to be examined, the number of repeat tests or replications (if any), blocking (if any), the assignment of the factor-level combinations to the experimental units, and the sequence of the test runs.
  1. An effect of the design factors on the response is measured by a change of the average response under two or more factor-level combinations. In its simplest form, the effect of a single two-level factor on a response is measured as the difference in the average response for the two levels of the factor; that is

Factor effect = average response at one level – average response at a second level

In other words, factor effects measure the influence of different levels of a factor on the value of the response. An observed effect is then said to be sufficiently precise if the standard deviation (or equivalently, the variance) of this statistic is sufficiently small through repeat testing or replication. There may be the presence of joint factor effects. We will discuss this in future blogs too.

  1. Randomization of the sequence of test runs or the assignment of factor-level combination to experimental units protects against unknown or unmeasured sources of possible bias. Randomization helps validate the assumptions need to apply certain statistical techniques.

For example, we know that there is a common problem of analytical  instrument drift. If during a series of experiments, instrument drift builds up over time, leading to later tests being biased. If all tests involving one level of a factor are dun first and all tests involving the second level of a factor are run last, comparisons of the factor levels will be biased by this instrument drift and will not provide a true measure of the effect of the factor.

In fact, randomization of the test runs cannot prevent instrument drift but is can help ensure that all levels of a factor have an equal chance of being affected by the drift. If so, differences in the response for pairs of factor levels will likely reflect the effects of the factor levels and not the effect of the drift.

ANOVA Variance Testing: An Important Statistical Tool to Know

Analysis of Variance Testing (ANOVA) is a way to test different means and their procedures against each other. Due to its design lending, itself to being used in multiple groups it is a popular method for testing. ANOVA tests both the null and alternative hypothesis in one test, just like a T-test (meant for only two means).

What Does a T-test do?

Since ANOVA testing is a larger scale T-test it is important to grasp what they do. A T-test simply put is a statistical analysis of two sample groups means. It looks at the two populations and determines if they are very different from each other. T-tests only work with small groups as well. Due to these constraints, it sometimes takes multiple T-tests in larger groups. While a T-test provides useful data that can help prove hypothesis it comes at a risk. Increasing the number of tests, increases calculations and time it takes to test a hypothesis. It also leaves room for more error. This risk leads way to larger format T-test in the ANOVA test.

ANOVA: The Basics

ANOVA’s statistical model compares the means of more than two groups of data. Just like a T-test it analyzes the group’s procedures against each other. These procedures or variables are analyzed to gauge if they are significantly different. Due to it being designed for multiple groups it is preferred over conducting multiple 2 group T-tests. ANOVA tests produce less errors in these larger groups and usually takes a shorter amount of time to test. They are general tests though. ANOVA tests are meant to test general hypothesis, they do not work for a hypothesis with a narrow scope.

Interpreting ANOVA Statistics

When an ANOVA test is conducted, there are values that can interpret whether a null hypothesis can be proven. The null hypothesis that is being tested is to prove whether or not the populations are equal. The populations are given different variables to see if they are equal. A significance level of 5% or lower shows a difference exists between the populations.

When you take the probability of a result from a p-test you get the p-value. If that value is less than the significance level, such as 0.05 for a 95% confidence, the value of the hypothesis can be proven true. That means there is a difference in populations based on variables. If the p-value is greater than the significance level the null hypothesis cannot be proven. This means that there is not enough evidence to suggest any difference.

In larger group sets it is important to group data to determine its significance in rejecting a null hypothesis. Confidence intervals can be used to determine the difference between groups. A confidence interval is simply but the level of uncertainty with a population.


Since ANOVA tests can interpret a population’s means based on variances, it has many practical uses in a laboratory setting. There are many standards required for a laboratory to be verified to be in working order. An ANOVA test can help predict any issues that might arise from implementing certain procedures. It can make scientifically appropriate predications based on variables given. An ANOVA test can detect any deficiencies in a practice or procedure. Eliminating those practices or changing how they are implemented can contribute to the overall professionalism of the lab. From the changes made from an ANOVA test a lab can be more productive and efficient.

Example of Applying an ANOVA Test in Lab Setting

ANOVA is designed to compare multiple populations making it much more user friendly that using a Ttest in this setting. For instance, an analysis of a analyte that has been done for quite some time now has 4 different test methods. For a lab to determine which works best and will produce most acceptable results, multiple T-tests will have to be done to determine what the outcome would be.

Let us say a lab is asked to preform that analysis while they are trying to figure out the best way to do it. They will have to use the method they have been using whether or not it is the most efficient way to do it. If it is less efficient it might take too much time could turn to loss of business. Using the wrong procedure could also produce a faulty outcome.

With an ANOVA test, only one test is needed for all of the ways a test method can be done. Having to do only one test eliminates all the time wasted on multiple T-tests. This produces the best solution for which procedure to use. This will eliminate unnecessary procedures and lets labs use the most efficient and current method possible.




What is statistical design of experiments – DOE

What is Statistical Design of Experiments (DOE)?

You can collect scientific data from two very basically different ways: first being through passive observation such as making astronomical observations or weather measurements or earthquake seismometer readings, and second, through active experimentation.

The passively awaiting informative events or data are essential for determining how Nature works and serve both as raw material from which new models – that is, new theories – are built and as evidence for or against existing ones.

Active experimentation on the other hand, is powerful in bringing up a physical relationship between observed phenomena and the contributing causes. You are able to manipulate and evaluate the effects of possibly many different inputs, commonly referred to as experimental factors. But, experimentation can be expensive and time consuming, and experimental ‘noise’ can make it difficult of clearly understand and interpret what results.

The classical experimental approach is to study each experimental factor separately. This one-factor-at-a-time (OFAT) strategy is easy to handle and widely employed.  But, we all know, this is not the most efficient way to approach an experimental problem when you have many variable factors to be considered.


Remember that the goal of any experiment is to obtain clear, reproducible results and establish general scientific validity. When you have a study which involves a large number of variables and each experiment on a variable while making the other variables constant can be time consuming. Moreover we cannot afford to run large numbers of trials due to budget constraints and other considerations.

Hence, it is important to develop a revolutionary approach with a statistical plan that guarantees experimenters or researchers an optimal research strategy. This is to help them to obtain better information faster and with less experimental effort. Design of experiments (DOE) therefore is a planned approach for determining cause and effect relationships. It can be applied to any process or experiment with measurable inputs and outputs.

To appreciate the power of DOE, let’s learn some of its historical background.

DOE was first developed for agricultural purposes. Agronomists were the first scientists to confront the problem of organizing their experiments to reduce the number of trials in the field. Their studies invariably include a large number of factors (parameters), such as soil composition, fertilizers’ effect, sunlight available, ambient temperature, wind exposure, rainfall rate, species studied, etc., and each experiment tends to last a long time before seeing and evaluating the results.

At the beginning of the 20th century, Fisher first proposed methods for organizing trials so that a combination of factors could be studied at the same time. These were the Latin Square, analysis of variance, etc. The ideas of Fisher were subsequently taken up by many well known agronomists such as Yates and Cochran and by statisticians such as Plackett and Burman, Youden, etc.

During World War II and thereafter, it became a tool for quality improvement, along with statistical process control (SPC). The concepts of DOE was used to develop powerful methods which were found useful amongst major industrial companies such as Du Pont de Nemours, ICI in England and TOTAL in France. They began using experimental designs in their laboratories to speedily improve their research activities on the products investigated.

Until 1980, DOE was mainly used in the process industries (i.e. chemical, food, pharmaceutical) mainly because of the ease with which the engineers could manipulate factors, such as time, temperature, pressure and flow rate.  Then, stimulated by the tremendous success of Japanese electronics and automobiles, SPC and DOE underwent a renaissance. Today, the advent of personal computers further catalyzed the use of these numerically intense methods.