Training and consultancy for testing laboratories.

Archive for the ‘Basic statistics’ Category

Outlier test statistics


It is quite difficult to identify suspected values visually as outliers from a set of data collected. Very often, an outlier statistic test is to be performed before further action such as deletion of such data or further testing, as we do not wish to discard such data without sound statistical justification. Read on … The outlier test statistic

What is an outlier?

outlying point

In repeated chemical analysis which assumes normal probability distribution, we may find some extreme (i.e. the biggest or smallest result) is a suspect which seems to be quite different from the rest of the data set.  In other words, this result does not seem to belong to the distribution of the rest of the data.  This suspect value is called an outlier.  Read more …. What is an outlier ?


Assumptions of using ANOVA


Assumptions of using ANOVA

Analysis of variance (ANOVA) is useful in laboratory data analysis for significance testing.  It however, has certain assumptions that must be met for the technique to be used appropriately.  Their assumptions are somewhat similar to those of regression because both linear regression and ANOVA are really just two ways of analysis the data that use the general linear model.  Departures from these assumptions can seriously affect inferences made from the analysis of variance.

The assumptions are:

  1. Appropriateness of data

The outcome variables should be continuous, measured at the interval or ratio level, and are unbounded or valid over a wide range.  The factor (group variables) should be categorical (i.e. being an object such as Analyst, Laboratory, Temperature, etc.);

  1. Randomness and independence

Each value of the outcome variable is independent of each other value to avoid biases. There should not have any influence of the data collected. That means the samples of the group under comparison must be randomly and independently drawn from the population.

  1. Distribution

The continuous variable is approximately normally distributed within each group. This distribution of the continuous variable can be checked by creating a histogram and by a statistical test for normality such as the Anderson-Darling or the Kolmogorov-Smirnov.  However, the one-way ANOVA F-test is fairly robust against departures from the normal distribution.  As long as the distributions are not extremely different from a normal distribution, the level of significance of the ANOVA F-test is usually not greatly affected by lack of normality, particularly for large samples.

  1. Homogeneity of variance

The variance of each of the groups should be approximately equal. This assumption is needed in order to combine or pool the variances within the groups into a single within-group source of variation SSW. The  Levene statistic test can be used to check variance homogeneity.  The null hypothesis is that the variance is homogeneous, so if the Levene statistic are not statistically significant (normally at alpha <0.05), the variances are assumed to be sufficiently homogeneous to proceed in the data analysis.


The variance ratio test

F-density modified A

The Fisher F-test statistic is based on the ratio of two experimentally observed variance, which are squared standard deviations.  Therefore, it is useful to test whether two standard deviations s1 and s2, calculated from two independent data sets are significantly different in terms of precision from each other.  Read more …The variance ratio F-test statistic

Techniques used in combining uncertainties

From the face expressions of participants attending my measurement uncertainty courses, I could tell that some of them had yet to grasp the important point of calculating the combined uncertainty from a series of uncertainty components.  I hope the following notes can bring more clarity to this issue….

Calculation techniques in combining uncertainties



What is least square difference?


Application of Least Significance Difference



Microbiological data transformation

Microbiological counting is normally done after incubating a portion of prepared sample on a sterilized culture medium at stipulated temperature over time.  Very often, microbial growth rate data are heteroscedatic, or non-normally distributed. The heteroscedatic data tend to have unequal variability (scatter) across a set of independent or predictor variables.  The presence of such data can be demonstrated graphically as following some kind of cone shape on a scatter graph, as in Figure 1 below:


The outcome of data analysis would be seriously flawed if we were to directly take the counts for statistical evaluation like what we would normally do for a set of chemical analysis data.

To overcome this, we may consider microbial growth data as being log-normally distributed to cater for the physiological or biochemical based mechanism involved.

Many microbiologists in their recovery studies would have noticed that the % recoveries can never be found satisfactory after dividing the experimental colony counts with the known inoculated number of bacteria. They tend to be in the region of 70% or so.  However, once the data are logarithmic transformed to the base of 10, the relative standard deviation RSD’s obtained are more acceptable, as shown in the figure below:

Figure 2:  The % recoveries of colony forming units (cfu)/ml

Log Recovery

Controversy of LOD (Detection Limit)

The limit of detection (LOD) is an important characteristic of a test method involving trace analysis but its concept has been, and still is, one of the most controversial in analytical chemistry.   Read more …  Controversies on Limit of Detection

What is MAD?


MAD stands for the median absolute deviation.

It is one way to estimate the variability in a set of data, particularly when the data set has some extreme values as outliers.  The general approach is to take the absolute values of deviations of individual values from a measure of their central tendency (i.e., the median of the data set).

Why do we choose to use the median instead of the arithmetic mean?

The median, by definition, is the middle value in an ordered sequence of data. Thus, it is unaffected by extreme values in the data set.

Therefore, we calculate the deviation of each data from the median of the original data set and then again find the median of the absolute values of these deviations, expressed as MAD.  It is being used in the robust statistics.

We often use wish to use MAD as an estimator of the population standard deviation, but it cannot be adopted directly but has to be multiplied by a constant, 1.4826 first to become a consistent estimate of the population standard deviation.

It may be noted that an important desirable property of a statistic is consistency.  A consistent statistic comes nearer to a population parameter when the size of the sample on which it is based gets larger. For example, analysis of 3 samples may give a sample mean which is much different from the expected population mean but when 30 samples are analyzed, the mean would be found much closer to the population parameter as the mean is known to be a consistent statistic.

You may wish to read our previous article which demonstrates how the constant, 1.4826 is derived from.




Estimation of both sampling and measurement uncertainties by Excel ANOVA Data Analysis tool

Sampling and analysis

Estimation of sampling and analytical uncertainties using Excel Data Analysis toolpak

In the previous blog ,  we used the basic ANOVA principles to analyze the total chromium Cr data for the estimation of measurement uncertainty covering both sampling and analytical uncertainties….