
In statistics, the average (mean) and sample standard deviation are known as “estimators” of the population mean and standard deviation. These estimates improve as the number of data collected increases. As we know, the use of these statistics requires data that are normally distributed, and for confidence intervals employing the standard deviation of the mean, this tends to be so.
However, real experimental data may be so distributed but often the distribution will contain data that are seriously flawed. They can be extremely low or high values. If we can identify such data and remove them from further consideration, then all is well and good.
Sometimes this is possible, but not always. This is a problem as a single rouge value can seriously upset our calculations of the mean and standard deviation.
Estimators that can tolerate a certain amount of ‘bad’ data are called robust estimators, and can be used when it is not possible to ensure that the data being processed has the correct characteristics.
For example, we can use the middle value of a set of ascending data (called median) as a robust estimator of the mean, and the range of the middle 68% of the data (called normalized interquartile range IQR) as a robust estimator of the standard deviation.
By definition, the median is the middle value of a set of data when arranged in ascending order. If there are odd number of data, then the median is the unique middle datum. If there are an even number, then the median is the average of the middle two data.
Median is robust, because no matter how outrageous one or more extreme values are, they are only individual values at the end of a list. Their magnitude is immaterial.
The interquartile range (IQR) is a measure of where the “middle fifty” is in a data set, i.e. the range of values that spans the middle 50% of data. Three quarters of the IQR, known as the normalized IQR, is an estimate of the standard deviation. In other words, the interquartile range formula is the median of the first quartile Q1 subtracted from that of the third quartile Q3:
IQR = Q3 – Q1
A problem with the IQR is that it is unrealistic to be used to calculate for small data sets, as we must have sufficient data to define quartiles (sections of the ordered data that contain one-quarter of the data).
Another robust estimator of standard deviation is median absolute deviation (MAD). It is a fairly simple estimate that can be implemented easily in a spreadsheet. The MAD from the data set median is calculated by:
MAD = median (| xi – median value |i=1,2,…n)
Robust methods have their place, particularly when we must keep all the data together in, for example, an interlaboratory comparison study where an outlying result from a laboratory cannot simply be ignored. They are less strongly affected by extreme values.
However, robust estimators are not really the best statistics, and wherever possible the statistics appropriate to the distribution of the data should be used.
So, when can we use these robust estimators?
Robust estimators can be considered to provide good estimates of the parameters for the ‘good’ data in an outlier-contaminated data set. They are appropriate when:
- The data are expected to be normally distributed. In here, robust statistics give answers very close to ordinary statistics
- The data are expected to be normally distributed, but contaminated with occasional spurious values which are regarded as unrepresentative or erroneous and approximately symmetrically distributed around the population mean. Robust estimators in here are less affected by these extreme values and hence are useful.
Remember that robust estimators are not recommended where the data set shows evidence of multi-modality or shows heavy skewing, especially when it is expected to follow non-normal or skewed distributions such as binomial and Poisson with low counts, chi-squared, etc. which generate extreme values with reasonable likelihood.
Leave a Reply