Training and consultancy for testing laboratories.

Archive for the ‘Basic statistics’ Category

Outlier test statistics in analytical data

Notes on outlier test statistics in analytical data

When an analytical method is repeated several times on a given sample, the measured values nearer to the mean (or average) of the data set tend to occur more often than those found further away from the mean value.  This is the characteristic of analytical chemistry following the normal probability distribution and the phenomenon is known to be a measure of central tendency.

However, there are times and again that we notice some extremely low or high value(s) which are visibly distant from the remainder of data.  These values can be suspected to be outliers which may be defined as observations in a set of data that appear to be inconsistent with the remainder of that set.

It is obvious that outlying values generally have an appreciable influence on calculated mean value and more influence on calculated standard deviation if they are not examined carefully and removed if necessary.

However we must remember that random variation of analysis does generate occasional values by chance. If so, these values are indeed part of the valid data and should generally be included in any statistical calculations.  However undesirable human error or other deviation in the analytical process such as instrument failure may cause outliers to appear from such faulty procedure.  Hence, it is important to have the effect of outliers minimized.

To minimize such effect, we have to find ways to identify outliers and distinguish them from chance variation. There are many outlier tests available which allow analysts to inspect suspect data and if necessary correct or remove erroneous values. These test statistics assume underlying a normal distribution and the test sample is relatively homogeneous. 

Furthermore, outlier testing needs careful consideration where the population characteristics are not known, or, worse, known to be non-normal.  For example, if the data were Poisson distributed, many valid high values might be incorrectly rejected because they appear inconsistent with a normal distribution. It is also crucial to consider whether outlying values might represent genuine features of the population.

Another approach is to use robust statistics which are not greatly affected by the presence of occasional extreme values and will still perform well when no outliers are present.

The outlier tests are aplenty for your disposal:  Dixon’s, Grubb’s, Levene’s, Cochran’s, Thompson’s, Bartlett’s,  Hartley’s, Brown-Forsythe’s, etc. They are quite simple to be applied on a set of analytical data. However, to be meaningful in the outcome, the number of data examined should be large rather than just a few.

Therefore, outlier tests are only to provide us with objective criteria or signal to investigate the cause; usually, outliers should not be removed from the data set solely because of the results of a statistical test.  Instead, the tests highlight the need to inspect the data more closely in the first instance. 

The general guidelines for acting on outlier tests on analytical data, based on the outlier testing and inspection procedure listed in ISO 5725 Part 2 Accuracy (trueness and precision) of measurement methods and results — Part 2: Basic method for the determination of repeatability and reproducibility of a standard measurement method are as follows:

  • To test at the 95% and the 99% confidence level
  • All outliers should be investigated and any errors corrected
  • Outliers significant at the 99% level may be rejected unless there is a technical reason to retain them
  • Outliers significant only at the 95% level (normally called ‘stragglers’) should be rejected only if there is an additional technical reason to do so
  • Successive testing and rejection are permissible, but not to the extent of rejecting a large proportion of the data.

The above procedure leads to results which are not so seriously biased by rejection of chance extreme values, but are rather relatively insensitive to outliers at the frequency commonly encountered in measurement work.  The application of robust statistics might be a better choice.

What do we know about robust estimators?

In statistics, the average (mean) and sample standard deviation are known as “estimators” of the population mean and standard deviation. These estimates improve as the number of data collected increases.  As we know, the use of these statistics requires data that are normally distributed, and for confidence intervals employing the standard deviation of the mean, this tends to be so.

However, real experimental data may be so distributed but often the distribution will contain data that are seriously flawed. They can be extremely low or high values. If we can identify such data and remove them from further consideration, then all is well and good.

Sometimes this is possible, but not always. This is a problem as a single rouge value can seriously upset our calculations of the mean and standard deviation.

Estimators that can tolerate a certain amount of ‘bad’ data are called robust estimators, and can be used when it is not possible to ensure that the data being processed has the correct characteristics.

For example, we can use the middle value of a set of ascending data (called median) as a robust estimator of the mean, and the range of the middle 68% of the data (called normalized interquartile range IQR) as a robust estimator of the standard deviation.

By definition, the median is the middle value of a set of data when arranged in ascending order. If there are odd number of data, then the median is the unique middle datum.  If there are an even number, then the median is the average of the middle two data. 

Median is robust, because no matter how outrageous one or more extreme values are, they are only individual values at the end of a list. Their magnitude is immaterial.

The interquartile range (IQR) is a measure of where the “middle fifty” is in a data set, i.e. the range of values that spans the middle 50% of data.  Three quarters of the IQR, known as the normalized IQR, is an estimate of the standard deviation. In other words, the interquartile range formula is the median of the first quartile Q1 subtracted from that of the third quartile Q3:

                        IQR = Q3 – Q1

A problem with the IQR is that it is unrealistic to be used to calculate for small data sets, as we must have sufficient data to define quartiles (sections of the ordered data that contain one-quarter of the data).

Another robust estimator of standard deviation is median absolute deviation (MAD). It is a fairly simple estimate that can be implemented easily in a spreadsheet. The MAD from the data set median is calculated by:

                        MAD = median (| xi – median value |i=1,2,…n)

Robust methods have their place, particularly when we must keep all the data together in, for example, an interlaboratory comparison study where an outlying result from a laboratory cannot simply be ignored.  They are less strongly affected by extreme values.

However, robust estimators are not really the best statistics, and wherever possible the statistics appropriate to the distribution of the data should be used.

So, when can we use these robust estimators?

Robust estimators can be considered to provide good estimates of the parameters for the ‘good’ data in an outlier-contaminated data set.  They are appropriate when:

  • The data are expected to be normally distributed. In here, robust statistics give answers very close to ordinary statistics
  • The data are expected to be normally distributed, but contaminated with occasional spurious values which are regarded as unrepresentative or erroneous and approximately symmetrically distributed around the population mean. Robust estimators in here are less affected by these extreme values and hence are useful.

Remember that robust estimators are not recommended where the data set shows evidence of multi-modality or shows heavy skewing, especially when it is expected to follow non-normal or skewed distributions such as binomial and Poisson with low counts, chi-squared, etc. which generate extreme values with reasonable likelihood.

A few words about Measurement Bias

A few words about Measurement Bias

In metrology, error is defined as “the result of measurement minus a given true value of the measurand”. 

What is ‘true value’?

ISO 3534-2:2006 (3.2.5) states that “Value which characterizes a quantity or quantitative characteristic perfectly defined in the conditions which exist when that quantity or quantitative characteristic is considered.”, and the Note 1 that follows suggests that this true value is a theoretical concept and generally cannot be known exactly.   

In other words, when you are asked to analyze a certain analyte concentration in a given sample, the analyte present has a value in the sample, but what we do in the experiment is only trying to determine that particular value. No matter how accurate is your method and how many repeats you have done on the sample to get an average value, we would never be 100% sure at the end that this average value is exactly the true value in the sample.  We bound to have a measurement error!

Actually in our routine analytical works, we do encounter three types of error, known as gross, random and systematic errors.

Gross errors leading to serious outcome with unacceptable measurement is committed through making serious mistakes in the analysis process, such as using a reagent titrant with wrong concentration for titration. It is so serious that there is no alternative but abandoning the experiment and making a completely fresh start. 

Such blunders however, are easily recognized if there is a robust QA/QC program in place, as the laboratory quality check samples with known or reference value (i.e. true value) will produce erratic results.

Secondly, when the analysis of a test method is repeated a large number of times, we get a set of variable data, spreading around the average value of these results.  It is interesting to see that the frequency of occurrence of data further away from the average value is getting fewer.  This is the characteristic of a random error.

There are many factors that can contribute to random error: the ability of the analyst to exactly reproduce the testing conditions, fluctuations in the environment (temperature, pressure, humidity, etc.), rounding of arithmetic calculations, electronic signals of the instrument detector, and so on.  The variation of these repeated results is referred to the precision of the method.

Systematic error, on the other hand, is a permanent deviation from the true result, no matter how many repeats of analysis would not improve the situation. It is also known as bias.

A color deficiency technician might persistently overestimate the end point in a titration, the extraction of an analyte from a sample may only be 90% efficient, or the on-line derivatization step before analysis by gas chromatography may not be complete. In each of these cases, if the results were not corrected for the problems, they would always be wrong, and always wrong by about the same amount for a particular experiment.

How do we know that we have a systematic error in our measurement?  

It can be easily estimated by measuring a reference material a large number of times.  The difference between the average of the measurements and the certified value of the reference material is the systematic error. It is important to know the sources of systematic error in an experiment and try to minimize and/or correct for them as much as possible.

If you have tried your very best and the final average result is still significantly different from the reference or true value, you have to correct the reported result by multiplying it with a certain correction factor.  If R is the recovery factor which is calculated by dividing your average test result by the reference or true value, the correction factor is 1/R.

Today, there is another statistical term in use.  It is ‘trueness’.  

The measure of truenessis usually expressed in terms of bias.

Trueness in ISO 3534-2:2006 is defined as “The closeness of agreement between the expectation of a test result or a measurement result and a true value.” whilst ISO 15195:2018 defines trueness as “Closeness of agreement between the average value obtained from a large series of results of measurements and a true value.”. The definition of ISO 15195 is quite similar to those of ISO 15971:2008 and ISO 19003:2006.  The ISO 3534-2 definition includes a note that in practice, an “accepted reference value” can be substituted for the true value.

Is there a difference between ‘accuracy’ and ‘trueness’?

The difference between ‘accuracy’ and ‘trueness’ is shown in their respective ISO definition.

ISO 3534-2:2006 (3.3.1) defines ‘accuracy’ as “closeness of agreement between a test result or measurement result and true value”, whilst the same standard in (3.2.5) defines ‘trueness’ as “closeness of agreement between the expectation of a test result or measurement result and true value”.  What does the word ‘expectation’ mean here?  It actually refers to the average of the test result, as given in the definition of ISO 15195:2018.

Hence, accuracy is a qualitative parameter whilst trueness can be quantitatively estimated through repeated analysis of a sample with certified or reference value.

References:

ISO 3534-2:2006   “Statistics – Vocabulary and symbols – Part 2: Applied statistics”

ISO 15195:2018   “Laboratory medicine – Requirements for the competence of calibration laboratories using reference measurement procedures”

In the next blog, we shall discuss how the uncertainty of bias is evaluated. It is an uncertainty component which cannot be overlooked in our measurement uncertainty evaluation, if present.

Why uncertainty is important in analytical chemistry?

Why measurement uncertainty is important in analytical chemistry?

Conducting a laboratory analysis is to make informed decisions on the samples drawn.  The result of an analytical measurement can be deemed incomplete without a statement (or at least an implicit knowledge) of its uncertainty.  This is because we cannot make a valid decision based on the result alone, and nearly all analysis is conducted to inform a decision.

We know that the uncertainty of a result is a parameter that describes a range within which the value of the quantity being measured is expected to lie, taking into account all sources of error, with a stated degree of confidence (usually 95%).  It characterizes the extent to which the unknown value of the targeted analyte is known after measurement, taking account of the given information from the measurement.

With a knowledge of uncertainty in hand, we can make the following typical decisions based on analysis:

  • Does this particular laboratory have the capacity to perform analyses of legal and statutory significance?
  • Does this batch of pesticide formulation contain less than the maximum allowed concentration of an impurity?
  • Does this batch of animal feed contain at least the minimum required concentration of profit (protein + fat)?
  • How pure is this batch of precious metal?

The figure below shows a variety of instances affecting decisions about compliance with externally imposed limits or specifications.  The error bars can be taken as expanded uncertainties, effectively intervals containing the true value of the concentration of the analyte with 95% confidence.

We can make the following observations from the above illustration:

  1. Result A clearly indicates the test result is below the limit, as even the extremity of the uncertainty interval is below the limit,
  2. Result B is below the limit but the upper end of the uncertainty is above the limit, so we not sure if the true value is below the limit.
  3. Result C is above the limit but the lower end of the uncertainty is below the limit, so we are not sure that the true value is above.
  4. What conclusions can we draw from the equal results D and E? Both results are above the limit but, while D is clearly above the limit, E is not so because the greater uncertainty interval extends below the limit. 

In short, we have to make decisions on how to act upon results B, C and E.  What is the level of risk that can be afforded to assume the test result is in conformity with the stated specification or in compliance with the regulatory limit?

By making such a decision rule, we must be serious in the evaluation of measurement uncertainty, making sure that the uncertainty obtained is reasonable.  If not, any decision made on conformity or compliance will be meaningless.

Initial Data Analysis IDA

Initial data analysis IDA

Data analysis is a systematic process examining datasets in order to draw valid conclusions about the information they contain, increasingly with the aid of specialized systems and software, leading to discovering useful information to make informed decisions to verify or disapprove some scientific or business models, theories or hypotheses.

As a researcher or laboratory analyst, we must have the drive to obtain quality data in our work. A careful plan in database design and statistical analysis with variable definitions, plausibility checks, data quality checks and ability to identifying likely errors in data and resolving data inconsistencies, etc. has to be established before embarking the full data collection.  More importantly, the plan should not be altered without agreement of the project steering team in order to reduce the extent of data dredging or hypothesis fishing leading to false positive studies.  Shortcomings in initial data analysis may result in adopting inappropriate statistical methods or making incorrect conclusions.

Our first step of initial data analysis is to check consistency and accuracy of the data, such as looking up for any outlying data. This can be visualized through plotting the data against time of data collection or other independent parameters.  This should be done before embarking on more complex analyses.

After having satisfied that the data are reasonably error-free, we should get familiar with the collected data and examine them for any consistency of data formats, number and patterns of missing data, the probability distributions of its continuous variables, etc.  For more advanced initial analysis, decisions have to be made about the way variables are used in further analyses with the aid of data analytics technologies or statistical techniques.  These variables can be studied in their raw form, transformed to some standardized format, categorized or stratified into groups for modeling.

Replication and successive dilution in constructing calibration curve

Replication and successive dilution in constructing calibration curve

An analytical instrument generally needs to be calibrated before measurements made on prepared sample solutions, through construction of a linear regression between the analytical responses and the concentrations of the standard analyte solutions. A linear regression is favored over quadratic or exponential curve as it incurs minimum error.   

Replication

Replication in standard calibration is found to be useful if replicates are genuinely independent. The calibration precision is improved by increasing the number of replicates, n, and provides additional checks on the calibration solution preparation and on the precision of different concentrations.

The trend of its precision can be read from the variance of these calibration points. A calibration curve might be found to have roughly constant standard deviations in all these plotted points, whilst others may show a proportional increase in standard deviation in line with the increase of analyte concentration. The former behavior is known as “homoscedasticity” and the latter, “heteroscedasticity”.  

It may be noted that increasing the number of independent concentration points has actually little benefit after a certain extent. In fact, after having six calibration points, it can be shown that any further increase in the number of observations in calibration has relatively modest effect on the standard error of prediction for a predicted x value unless such number of points increases very substantially, say to 30 which of course is not practical.

Instead, independent replication at each calibration point can be recommended as a method of improving uncertainties. Indeed, independent replication is accordingly a viable method of increasing n when the best performance is desired. 

However, replication suffers from an important drawback. Many analysts incline to simply injecting a calibration standard solution twice, instead of preparing duplicate standard solutions separately for the injection. By injecting the same standard solution twice into the analytical instrument, the plotted residuals will appear in close pairs but are clearly not independent. This is essentially useless for improving precision. Worse, it artificially increases the number of freedom for simple linear regression, giving a misleading small prediction interval.

Therefore ideally replicated observations should be entirely independent, using different stock calibration solutions if at all possible. Otherwise it is best to first examine replicated injections to check for outlying differences and then to calculate the calibration based on the mean value of y for each distinct concentration.

There is one side effect of replication that may be useful. If means of replicates are taken, the distribution of errors in the mean tend to be the normal distribution as the number of replicates increases, regardless of parent distribution. The distribution of the mean of as few as 3 replicates is very close to the normal distribution even with fairly extreme departure from normality.  Averaging three or more replicates can therefore provide more accurate statistical inference in critical cases where non-normality is suspected.

Successive dilutions

A common pattern of calibration that we usually practice is doing a serial dilution, resulting in logarithmically decreasing concentrations (for example, 16, 8, 4. 2 and 1 mg/L). This is simple and has the advantage of providing a high upper calibrated level, which may be useful in analyzing routine samples that occasionally show high values.

However, this layout has several disadvantages. First, errors in dilution are multiplied at each step, increasing the volume uncertainties, and perhaps worse, increasing the risk of any undetected gross dilution error (especially if the analyst commits the cardinal sin of using one of the calibration solutions as a QC sample as well!).

Second, the highest concentration point has high leverage, affecting both the gradient and y-intercept of the line plotted; errors at the high concentration will cause potentially large variation in results.

Thirdly, departure in linearity are easier to detect with fairly even spaced points.  In general, therefore, equally spaced calibration points across the range of interest should be much preferred.

A few words on sampling

A few words on sampling

  1. What is sampling

Sampling is a process of selecting a portion of material (statistically termed as ‘population’) to represent or provide information about a larger body or material.  It is essential for the whole testing and calibration processes.

The old ISO/IEC 17025:2005 standard defines sampling as “a defined procedure whereby a part of a substance, material or product is taken to provide for testing or calibration of a representative sample of the whole.  Sampling may also be required by the appropriate specification for which the substance, material or product is to be tested or calibrated. In certain cases (e.g. forensic analysis), the sample may not be representative but is determined by availability.” 

In other words, sampling, in general, should be carried out in random manner but so-called judgement sampling is also allowed in specific cases.  This judgement sampling approach involves using knowledge about the material to be sampled and about the reason for sampling, to select specific samples for testing. For example, an insurance loss adjuster acting on behalf of a cargo insurance company to inspect a shipment of damaged cargo during transit will apply a judgement sampling procedure by selecting the worst damaged samples from the lot in order to determine the cause of damage.

2. Types of samples to be differentiated

Field sample      Random sample(s) taken from the material in the field.  Several random samples can be drawn and compositing the samples is done in the field before sending it to the laboratory for analysis

Laboratory sample       Sample(s) as prepared for sending to the laboratory, intended for inspection or testing.

Test sample       A sub-sample, which is a selected portion of the laboratory sample, taken for laboratory analysis.

3. Principles of sampling

Randomization

Generally speaking, random sampling is a method of selection whereby each possible member of a population has an equal chance of being selected so that unintended bias can be minimized. It provides an unbiased estimate of the population parameters on interest (e.g. mean), normally in terms of analyte concentration.

Representative samples

“Representative” refers to something like “sufficiently like the population to allow inferences about the population”.  By taking a single sample through any random process may not be necessary to have representative composition of the bulk.  It is entirely possible that the composition of a particular sample randomly selected may be completely unlike the bulk composition, unless the population is very homogeneous in its composition distribution (such as drinking water).

Remember the saying that the test result is no better than the sample that it is based upon.  Sample taken for analysis should be as representative of the sampling target as possible.  Therefore, we must take the sampling variance into serious consideration. The larger the sampling variance, the more likely it is that the individual samples will be very different from the bulk.

Hence, in practice, we must carry out representative sampling which involves obtaining samples which are not only unbiased, but which also have sufficiently small variance for the task in hand. In other words, we need to decide on the number of random samples to be collected in the field to provide smaller sampling variance in addition to choosing randomization procedures that provide unbiased results.  This is normally decided upon information such as the specification limits and uncertainty expected.

Composite samples

Often it is useful to combine a collection of field samples into a single homogenized laboratory sample for analysis. The measured value for the composite laboratory sample is then taken as an estimate of the mean value for the bulk material.

It is important to note also that the importance of a sound sub-sampling process in the laboratory cannot be over emphasized. Hence, there must be a SOP prepared to guide the laboratory analyst to draw the test sample for measurement from the sample that arrives at the laboratory.

4. Sampling uncertainty

Today, sampling uncertainty is recognized as an important contributor to the measurement uncertainty associated with the reported results.

It is to be noted that sampling uncertainty cannot be estimated as a standalone identity. The analytical uncertainty has to be evaluated at the same time.  For a fairly homogeneous population, a one-factor ANOVA (Analysis of Variance) method will be suffice to estimate the overall measurement uncertainty based on the between- and within-sample variance.  See https://consultglp.com/2018/02/19/a-worked-example-to-estimate-sampling-precision-measuremen-uncertainty/

However, for heterogeneous population such as soil in a contaminated land, sample location variance in addition to sampling variance to be taken into account. More complicated calculations involve the application of the two-way ANOVA technique.  An EURACHEM’s worked example can be found at the website:  https://consultglp.com/2017/10/10/verifying-eurachems-example-a1-on-sampling-uncertainty/

What is the P-value for in ANOVA?

What is the P-value for in ANOVA?

In the analysis of variance (ANOVA), we study the variations of between- and within-groups in terms of their respective mean squares (MS) which are calculated by dividing each sum of squares by its associated degrees of freedom.  The result, although termed a mean square, is actually a measure of variance, which is the squared standard deviation.

The F-ratio is then obtained as the result of dividing MS(between) and MS(within).  Even if the population means are all equal to one another, you may get an F-ratio which is substantially larger than 1.0, simply because of sampling error to cause a large variation between the samples (group). Such F-value may get even larger than the F-critical value from the F-probability distribution at given degrees of freedom associated with the two MS at a set significant Type I (alpha-) level of error.

Indeed, by referring to the distribution of F-ratios with different degrees of freedom, you can determine the probability of observing an F-ratio as large as the one you calculate even if the populations have the same mean values.

So, the P-value is the probability of obtaining an F-ratio as large or larger than the one observed, assuming that the null hypothesis of no difference amongst group means is true.

However, under the ground rules that have been followed for many years by inferential statistics, this probability must be equal to, or smaller than, the significant alpha- (type I) error level that we have established at the start of the experiment, and such alpha-level is normally set at 0.05 (or 5%) for test laboratories.  Using this level of significance, there is, on average, a 1 in 20 chance that we shall reject the null hypothesis in our decision when it is in fact true.

Hence, if we were to analyze a set of data by ANOVA and our P-value calculated was 0.008, which is much smaller than alpha-value of 0.05, we can confidently say that we would be committing just an error or risk of 0.8% to reject the null hypothesis which is true.  In other words, we are 99.2% confident to reject the hypothesis which states no difference among the group means.

Verifying the Excel’s one-factor ANOVA results

Verifying the Excel’s one-factor ANOVA results

In a one-factor ANOVA (Analysis of Variance), we check the replicate results of each group under a sole factor. For example, we may be evaluating the performance of four analysts who have each been asked to perform replicate determinations of identical samples drawn from a population.  Thus, the factor is “Analyst” of which we have four so-called “groups” – Analyst A, Analyst B, Analyst C and Analyst D.  The test data are laid out in a matrix with the four groups of the factor in each column and the replicates (five replicates in this example) in each row, as shown in the following table:

We then use the ANOVA principal inferential statistic, the F-test, to decide whether differences between treatment’s mean values are real in the population, or simply due to random error in the sample analysis.  The F-test studies the ratio of two estimates of the variance.  In this case, we use the variance between groups, divided by the variance within groups.  

The MS Excel installation files have included an add-in that performs several statistical analyses.  It is called “Data Analysis” add-in.  If you do not find it labelled in the Ribbon of your spreadsheet, you can make it available to Excel by installing it with your Excel 2016 version by clicking “File -> Options -> Add-ins -> Analysis ToolPak” and then pressing “Enter”. You should now find Data Analysis in the Analysis Group on your Excel’s Data tab.

We can then click the “Data Analysis” button to look for “Anova: Single Factor” entry and start its dialog box accordingly.

For the above set of data from Analysts A to D, the Excel’s Data Analysis gives the following outputs, and we shall then verify the outcomes through manual calculations based on the first principles:

We know that variances are defined and calculated using the squared deviations of a single variable.  In Excel, we can use the formula “=DEVSQ( )” to calculate each group of results.  Also, we can use “=AVERAGE( )” function to calculate the individual mean of each Analyst.

In this univariate ANOVA example, we squared the deviation of a value from a mean, and the word “deviation” referred to the difference between each measurement result from the mean of the Analyst concerned.  

The above figure shows the manual calculations using the Excel formulae agree very well with the Excel’s calculated data by its Data Analysis package.  In words, we have:

  • The Sum of Squares Between SSB uses DEVSQ( ) to take the sum of the squared deviations of each group (Analyst) mean from the grand mean, and multiplies by the number of replicates in each group;
  • The Sum of Squares Within SSW uses the replicates of each DEVSQ( ) in J13:M13 to get the sum of the squared deviations of each measured value from the mean of its groups; then, the results from DEVSQ( ) are totaled;
  • The Sum of Squares Total SST uses DEVSQ( ) to return the sum of the squared deviations of each measurement from the grand mean value.  We can also just add up SSB and SSW to give SST.

Subsequently, we can also verify the Excel’s calculations of the F-value, P-value and the F critical value by using the various formulae as shown above.

We have normally set the level of significance at the P = 0.05 (or 5%), meaning that we are prepared to make a 5% error in rejecting the null hypothesis which states that there are no difference amongst the mean values of these four Analysts.  The calculated P-value of 0.008 indicates that our risk of rejecting the null hypothesis is only at a low 0.8%.

Are your linear regression data homoscedastic or heteroscedastic?

In instrumental analysis, there must be a measurement model, an equation that relates the amounts to be measured to the instrument response such as absorbance, transmittance, peak areas, peak heights, potential current, etc.  From this model, we can then derive the calibration equation.

It is our usual practice to perform the experiment in such as a way as to fix influence standard concentration of the measurand and the instrument response in a simple linear relationship, i.e., 

y = a + bx                                                                   ………. [1]

where

y is the indication of the instrument (i.e., the instrument response),

x is the independent variable (i.e., mostly for our purpose, the concentration of the measurand)

and, 

a and b are the coefficients of the model, known as the intercept and slope (or gradient) of the curve, respectively.

Therefore, for a number of xi values, we will have the corresponding instrument responses, yi. We then fit the above model of equation to the data.

As usual, any particular instrumental measurement of yi will be subject to measurement error (ei), that is,

yi = a + bxi + ei                                                                            …….. [2]

To get this linear model, we have to find a line that is best fit for the data points that we have obtained experimentally. We use the ordinary least square (OLS) approach, which chooses the model parameters that minimize the residual sum of squares (RSS) of the predicted y values versus the actual or experimental y values.  The residual (or sometimes called error), in this case, means the difference between the predicted yi value derived from the above equation and the experimental yi value.

So, if the linear equation model is correct, the sum of all the differences from all the points (x, y) on the plot should be arithmetically equal to zero.

It must be stressed however, that for the sake of the above statement to be true, we make an important assumption, i.e., the uncertainty of the independent variable, xi, is very much less than in the instrument response, hence, only one error term ei in yi is considered due to this uncertainty which is sufficiently small to be neglected.  Such assumption is indeed valid for our laboratory analytical purposes and the estimation process of measurement error is then very much simplified.

What is another important assumption made in this OLS method? 

It is that the data are known to be homoscedastic, which means that the errors in y are assumed to be independent of the concentration.  In other words, the variance of y remains constant and does not change for each xi value or for a range of x values.   This also means that all the points have equal weight when the slope and intercept of the line are calculated. The following plot illustrates this important point.

However, in many of our chemical analysis, this assumption is not likely to be valid.  In fact, many data are heteroscedastic, i.e. the standard deviation of the y-values increases with the concentration of the analyte, rather than having the constant value of variation at all concentrations. In other words, the errors that are approximately proportional to the analyte concentration. In fact, we find their relative standard deviations which are standard deviations divided by the mean values are roughly constant. The following plot illustrates this particular scenario.

In this case, the weighted regression method is to be applied. The regression line must be calculated to give additional weight to those points where the errors are smallest, i.e. it is important for the calculated line to pass close to such points than to pass close to the points representing higher concentrations with the largest errors.

This is achieved by giving each point a weighting inversely proportional to the corresponding y-direction variance, si2.  Without going through details of its calculations which can be quite tedious and complex as compared with those of the unweighted ones, , it is suffice to say that in our case of instrumental calibration which normally sees the experimental points fit a straight line very well, we would find the slope (b) and y-intercept (a) of the weighted line are remarkably similar to those of the unweighted line, and the results of the two approaches give very similar values for the concentrations of samples within the linearity of the calibration line.

So, does it mean that one the face of it, the weighted regression calculations have little value to us?

The answer is a No.  

In addition to providing results very similar to those obtained from the simpler unweighted regression method, we find values in getting more realistic results on the estimation of the errors or confidence limits of those sample concentrations under study.  It can be shown by calculations that we will have narrower confidence limits at low concentrations in the weighted regression and its confidence limit increases with increasing instrumental signals, such as absorbance.  A general form of the confidence limits for a concentration determined using a weighted regression line is show in the sketch below:

These observations emphasize the particular importance of using weighted regression when the results of interest include those at low concentrations.  Similarly, detection limits may be more realistically assessed using the intercept and standard deviation obtained from a weighted regression graph.