What do we know about robust estimators?

In statistics, the average (mean) and sample standard deviation are known as “estimators” of the population mean and standard deviation. These estimates improve as the number of data collected increases. As we know, the use of these statistics requires data that are normally distributed, and for confidence intervals employing the standard deviation of the mean, this tends to be so.

However, real experimental data may be so distributed but often the distribution will contain data that are seriously flawed. They can be extremely low or high values. If we can identify such data and remove them from further consideration, then all is well and good.

Sometimes this is possible, but not always. This is a problem as a single rouge value can seriously upset our calculations of the mean and standard deviation.

Estimators that can tolerate a certain amount of ‘bad’ data are called robust estimators, and can be used when it is not possible to ensure that the data being processed has the correct characteristics.

For example, we can use the middle value of a set of ascending data (called median) as a robust estimator of the mean, and the range of the middle 68% of the data (called normalized interquartile range IQR) as a robust estimator of the standard deviation.

By definition, the median is the middle value of a set of data when arranged in ascending order. If there are odd number of data, then the median is the unique middle datum. If there are an even number, then the median is the average of the middle two data.

Median is robust, because no matter how outrageous one or more extreme values are, they are only individual values at the end of a list. Their magnitude is immaterial.

The interquartile range (IQR) is a measure of where the “middle fifty” is in a data set, i.e. the range of values that spans the middle 50% of data. Three quarters of the IQR, known as the normalized IQR, is an estimate of the standard deviation. In other words, the interquartile range formula is the median of the first quartile Q1 subtracted from that of the third quartile Q3:

IQR = Q3 – Q1

A problem with the IQR is that it is unrealistic to be used to calculate for small data sets, as we must have sufficient data to define quartiles (sections of the ordered data that contain one-quarter of the data).

Another robust estimator of standard deviation is median absolute deviation (MAD). It is a fairly simple estimate that can be implemented easily in a spreadsheet. The MAD from the data set median is calculated by:

MAD = median (| xi – median value |i=1,2,…n)

Robust methods have their place, particularly when we must keep all the data together in, for example, an interlaboratory comparison study where an outlying result from a laboratory cannot simply be ignored. They are less strongly affected by extreme values.

However, robust estimators are not really the best statistics, and wherever possible the statistics appropriate to the distribution of the data should be used.

So, when can we use these robust estimators?

Robust estimators can be considered to provide good estimates of the parameters for the ‘good’ data in an outlier-contaminated data set. They are appropriate when:

  • The data are expected to be normally distributed. In here, robust statistics give answers very close to ordinary statistics
  • The data are expected to be normally distributed, but contaminated with occasional spurious values which are regarded as unrepresentative or erroneous and approximately symmetrically distributed around the population mean. Robust estimators in here are less affected by these extreme values and hence are useful.

Remember that robust estimators are not recommended where the data set shows evidence of multi-modality or shows heavy skewing, especially when it is expected to follow non-normal or skewed distributions such as binomial and Poisson with low counts, chi-squared, etc. which generate extreme values with reasonable likelihood.

What is MAD?

MAD stands for the median absolute deviation.

It is one way to estimate the variability in a set of data, particularly when the data set has some extreme values as outliers. The general approach is to take the absolute values of deviations of individual values from a measure of their central tendency (i.e., the median of the data set).

Why do we choose to use the median instead of the arithmetic mean?

The median, by definition, is the middle value in an ordered sequence of data. Thus, it is unaffected by extreme values in the data set.

Therefore, we calculate the deviation of each data from the median of the original data set and then again find the median of the absolute values of these deviations, expressed as MAD. It is being used in the robust statistics.

We often use wish to use MAD as an estimator of the population standard deviation, but it cannot be adopted directly but has to be multiplied by a constant, 1.4826 first to become a consistent estimate of the population standard deviation.

It may be noted that an important desirable property of a statistic is consistency. A consistent statistic comes nearer to a population parameter when the size of the sample on which it is based gets larger. For example, analysis of 3 samples may give a sample mean which is much different from the expected population mean but when 30 samples are analyzed, the mean would be found much closer to the population parameter as the mean is known to be a consistent statistic.

You may wish to read our previous article https://consultglp.com/2015/02/02/robust-statistics-mad-method/ which demonstrates how the constant, 1.4826 is derived from.

Mean or Median?

Mean or median? You decide

The mean and median are two of the three kinds of “averages”. The third one is called mode. Both mean and median are measures of the central tendency of the dataset, but they have different meanings with different advantages and disadvantages in applications.

The sample mean is calculated as the average of all the data, i.e., we add up all the observations and divide by the number of observations. The median on the other hand partitions the data into two parts such that there is an equal number of observations on either side of the median. So, if we have a set of 5 data arranged in ascending order, the middle value on the 3rd place is the median. However, if we have a set of 6 data in ascending order, then the average of the 3rd and 4th data is the median.

One important advantage of the median is that it is not influenced by extreme values (or outliers statistically speaking) in the dataset. Only either the middle observation or average of the two middle observations is used in the calculation, whilst the actual values of the remaining data are not considered. It is commonly used in proficiency testing programs to assess inter-laboratory comparison data. The robust statistics also uses MAD (median absolute deviation) as a measure of the variability of a univariate quantitative data set.

The mean on the other hand is sensitive to all values in the dataset because every observation in the data affects the mean value, and extreme observations can have a substantial influence on the mean calculated.

Generally speaking, the mean value has some very important mathematical properties that make it possible to prove theorems, such as the Central Limit Theorem. We note that useful results within statistics and inference methods naturally give rise to the mean value as a parameter estimate.

It is much more problematic to prove mathematical results related to the median even though it is more robust to extreme observations. So, the mean is used for systematic quantitative data, unless there is a situation with extreme values, where the median is used, instead.

Therefore, you have to make a professional judgement on which one is to be used to suit your purpose.

Robust statistics – MAD method

Robust Statistics – MAD method