Training and consultancy for testing laboratories.

Posts tagged ‘Linear regression’

Replication and successive dilution in constructing calibration curve

Replication and successive dilution in constructing calibration curve

An analytical instrument generally needs to be calibrated before measurements made on prepared sample solutions, through construction of a linear regression between the analytical responses and the concentrations of the standard analyte solutions. A linear regression is favored over quadratic or exponential curve as it incurs minimum error.   

Replication

Replication in standard calibration is found to be useful if replicates are genuinely independent. The calibration precision is improved by increasing the number of replicates, n, and provides additional checks on the calibration solution preparation and on the precision of different concentrations.

The trend of its precision can be read from the variance of these calibration points. A calibration curve might be found to have roughly constant standard deviations in all these plotted points, whilst others may show a proportional increase in standard deviation in line with the increase of analyte concentration. The former behavior is known as “homoscedasticity” and the latter, “heteroscedasticity”.  

It may be noted that increasing the number of independent concentration points has actually little benefit after a certain extent. In fact, after having six calibration points, it can be shown that any further increase in the number of observations in calibration has relatively modest effect on the standard error of prediction for a predicted x value unless such number of points increases very substantially, say to 30 which of course is not practical.

Instead, independent replication at each calibration point can be recommended as a method of improving uncertainties. Indeed, independent replication is accordingly a viable method of increasing n when the best performance is desired. 

However, replication suffers from an important drawback. Many analysts incline to simply injecting a calibration standard solution twice, instead of preparing duplicate standard solutions separately for the injection. By injecting the same standard solution twice into the analytical instrument, the plotted residuals will appear in close pairs but are clearly not independent. This is essentially useless for improving precision. Worse, it artificially increases the number of freedom for simple linear regression, giving a misleading small prediction interval.

Therefore ideally replicated observations should be entirely independent, using different stock calibration solutions if at all possible. Otherwise it is best to first examine replicated injections to check for outlying differences and then to calculate the calibration based on the mean value of y for each distinct concentration.

There is one side effect of replication that may be useful. If means of replicates are taken, the distribution of errors in the mean tend to be the normal distribution as the number of replicates increases, regardless of parent distribution. The distribution of the mean of as few as 3 replicates is very close to the normal distribution even with fairly extreme departure from normality.  Averaging three or more replicates can therefore provide more accurate statistical inference in critical cases where non-normality is suspected.

Successive dilutions

A common pattern of calibration that we usually practice is doing a serial dilution, resulting in logarithmically decreasing concentrations (for example, 16, 8, 4. 2 and 1 mg/L). This is simple and has the advantage of providing a high upper calibrated level, which may be useful in analyzing routine samples that occasionally show high values.

However, this layout has several disadvantages. First, errors in dilution are multiplied at each step, increasing the volume uncertainties, and perhaps worse, increasing the risk of any undetected gross dilution error (especially if the analyst commits the cardinal sin of using one of the calibration solutions as a QC sample as well!).

Second, the highest concentration point has high leverage, affecting both the gradient and y-intercept of the line plotted; errors at the high concentration will cause potentially large variation in results.

Thirdly, departure in linearity are easier to detect with fairly even spaced points.  In general, therefore, equally spaced calibration points across the range of interest should be much preferred.

Are your linear regression data homoscedastic or heteroscedastic?

In instrumental analysis, there must be a measurement model, an equation that relates the amounts to be measured to the instrument response such as absorbance, transmittance, peak areas, peak heights, potential current, etc.  From this model, we can then derive the calibration equation.

It is our usual practice to perform the experiment in such as a way as to fix influence standard concentration of the measurand and the instrument response in a simple linear relationship, i.e., 

y = a + bx                                                                   ………. [1]

where

y is the indication of the instrument (i.e., the instrument response),

x is the independent variable (i.e., mostly for our purpose, the concentration of the measurand)

and, 

a and b are the coefficients of the model, known as the intercept and slope (or gradient) of the curve, respectively.

Therefore, for a number of xi values, we will have the corresponding instrument responses, yi. We then fit the above model of equation to the data.

As usual, any particular instrumental measurement of yi will be subject to measurement error (ei), that is,

yi = a + bxi + ei                                                                            …….. [2]

To get this linear model, we have to find a line that is best fit for the data points that we have obtained experimentally. We use the ordinary least square (OLS) approach, which chooses the model parameters that minimize the residual sum of squares (RSS) of the predicted y values versus the actual or experimental y values.  The residual (or sometimes called error), in this case, means the difference between the predicted yi value derived from the above equation and the experimental yi value.

So, if the linear equation model is correct, the sum of all the differences from all the points (x, y) on the plot should be arithmetically equal to zero.

It must be stressed however, that for the sake of the above statement to be true, we make an important assumption, i.e., the uncertainty of the independent variable, xi, is very much less than in the instrument response, hence, only one error term ei in yi is considered due to this uncertainty which is sufficiently small to be neglected.  Such assumption is indeed valid for our laboratory analytical purposes and the estimation process of measurement error is then very much simplified.

What is another important assumption made in this OLS method? 

It is that the data are known to be homoscedastic, which means that the errors in y are assumed to be independent of the concentration.  In other words, the variance of y remains constant and does not change for each xi value or for a range of x values.   This also means that all the points have equal weight when the slope and intercept of the line are calculated. The following plot illustrates this important point.

However, in many of our chemical analysis, this assumption is not likely to be valid.  In fact, many data are heteroscedastic, i.e. the standard deviation of the y-values increases with the concentration of the analyte, rather than having the constant value of variation at all concentrations. In other words, the errors that are approximately proportional to the analyte concentration. In fact, we find their relative standard deviations which are standard deviations divided by the mean values are roughly constant. The following plot illustrates this particular scenario.

In this case, the weighted regression method is to be applied. The regression line must be calculated to give additional weight to those points where the errors are smallest, i.e. it is important for the calculated line to pass close to such points than to pass close to the points representing higher concentrations with the largest errors.

This is achieved by giving each point a weighting inversely proportional to the corresponding y-direction variance, si2.  Without going through details of its calculations which can be quite tedious and complex as compared with those of the unweighted ones, , it is suffice to say that in our case of instrumental calibration which normally sees the experimental points fit a straight line very well, we would find the slope (b) and y-intercept (a) of the weighted line are remarkably similar to those of the unweighted line, and the results of the two approaches give very similar values for the concentrations of samples within the linearity of the calibration line.

So, does it mean that one the face of it, the weighted regression calculations have little value to us?

The answer is a No.  

In addition to providing results very similar to those obtained from the simpler unweighted regression method, we find values in getting more realistic results on the estimation of the errors or confidence limits of those sample concentrations under study.  It can be shown by calculations that we will have narrower confidence limits at low concentrations in the weighted regression and its confidence limit increases with increasing instrumental signals, such as absorbance.  A general form of the confidence limits for a concentration determined using a weighted regression line is show in the sketch below:

These observations emphasize the particular importance of using weighted regression when the results of interest include those at low concentrations.  Similarly, detection limits may be more realistically assessed using the intercept and standard deviation obtained from a weighted regression graph.

Std additions in instrumental calibration

Standard addition calibration

When there is a significant difference between the working calibration standard solutions for instrumental analysis and the sample matrix, its matrix background interference whilst using the calibration curve prepared by ‘pure’ standard solutions cannot be overlooked.  Using the standard additions method for instrumental calibration is a good option.  Read on …..

Calibration by standard additions method

 

 

Improving uncertainty of linear calibration experiments

Standard error of cal curve

Improving uncertainty of linear calibration experiments

 

ANOVA and regression calculations

Linear regression graph

ANOVA and regression calculations

A linear regression approach to check bias between methods – Part II

 

Plot of reference and trial methods

A Worked Example

Suppose that we determined the amount of uranium contents in 14 stream water samples by a well-established laboratory method and a newly-developed hand-held rapid field method…..

A linear regression approach to bias between methods – Part II

A linear regression approach to check bias between methods – Part I

Plot of reference and trial methods

Linear regression is used to establish a relationship between two variables. In analytical chemistry, linear regression is commonly used in the construction of calibration curve for analytical instruments in, for example, gas and liquid chromatographic and many other spectrophotometric analyses….

A linear regression approach to bias between methods – Part I

 

 

Linear calibration curve – two common mistakes

Cal Curve

Linear calibration curve – two common mistakes

Generally speaking, linear regression is used to establish or confirm a relationship between two variables. In analytical chemistry, it is commonly used in the construction of calibration functions required for techniques such as GC, HPLC, AAS, UV-Visible spectrometry, etc., where a linear relationship is expected between the instrument response (dependent variable) and the concentration of the analyte of interest.

The word ‘dependent variable’ is used for the instrument response because the value of the response is dependent on the value of concentration. The dependent variable is conventionally plotted on the y-axis of the graph (scatter plot) and the known analyte concentration (independent variable) on x-axis, to see whether a relationship exists between these two variables.

In chemical analysis, a confirmation of such relationship between these two variables is essential and this can be establish in terms of an equation. The other aspects of the calibration can then be proceeded.

The general equation which describes a fitted straight line can be written as:

y = a + bx

where b is the gradient of the line and a, its intercept with the y-axis. The least-squares linear regression method is normally used to establish the values of a and b. The ‘best fit’ line obtained from the squares linear regression is the line which minimizes the sum of the squared differences between the observed (or experimental) and line-fitted values for y.

The signed difference between an observed value (y) and the fitted value (ŷ) is known as a residual. The most common form of regression is of y on x.  This comes with an important assumption, i.e. the x values are known exactly without uncertainty and the only error occurs in the measurement of y.

Two mistakes are so common in routine application of linear regression that it is worth describing them so that they can be well avoided:

  1. Incorrectly forcing the regression through zero

Some instrument software allows a regression to be forced through zero (for example, by specifying removal of the intercept or ticking a “Constant is ‘zero’ option”).

This is valid only with good evidence to support its use, for example, if it has been previously shown that y-the intercept is not significant after statistical analysis. Otherwise, interpolated values at the ends of the calibration range will be incorrect. It can be very serious near zero.

  1. Including the point (0,0) in the regression when it has not been measured

Sometimes it is argued that the point (x = 0, y = 0) should be included in the regression, usually on the grounds that y = 0 is an expected response at x = 0. This is actually a bad practice and not allowed at all. It seems that we simply cook up the data.

Adding an arbitrary point at (0,0) will cause the fitted line to be more closer to (0,0), making the line fit the data more poorly near zero and also making it more likely that a real non-zero intercept will go undetected (because the calculated y-intercept will be smaller).

The only circumstance in which a point (0,0) can be validly be added to the regression data set is when a standard at zero concentration has been included and the observed response is either zero or is too small to detect and can reasonably be interpreted as zero.

 

How to evaluate outliers in regression?

outliers-in-regression

Common mistakes in application of linear regression

common-mistakes-in-linear-regression-application