Training and consultancy for testing laboratories.

Archive for February, 2019

Proficiency testing – 2

Proficiency testing – what, why and how (Part II)

The Part 1 of this article series discussed the rationale to conduct proficiency testing (PT) programs and various proficiency assessment tools such as setting an assigned value and estimation of its standard deviation to reflect inter-laboratory variations. Let’s see how the scoring of PT results is made.

  1. The z-Score

The most common scoring system is the z-score for a proficiency test result xi, calculated as:


xa is the assigned value and

sigmapt is the standard deviation for proficiency assessment.

Those readers who are familiar with the normal probability distribution function should appreciate the use of this z-score which is to standardize all randomly distributed data to a standard normal distribution N(0,1) with mean = 0 and sigma2 = 1.

The conventional interpretation of z -scores is as follows:

  • A result that gives | z | ≤ 2 is considered to be acceptable;
  • A result that gives 2 < | z | < 3 is considered to give a warning signal;
  • A result that gives | z | ≥ 3 is considered to be unacceptable or unsatisfactory performance (or action signal).

Assuming all participants perform exactly in accordance with the performance requirements, then by the normal distribution, about 95% of values are then expected to be within two standard deviations of the mean value. In other words, there is only a 5% chance that a valid result would fall further than two standard deviation from the mean.

The probability of finding a valid result more than three standard deviations away from the mean is very low (approximately 0.3% for a normal distribution). Therefore, a score of  | z | ≥ 3 is considered unsatisfactory performance. Also, participants should be advised to check their measurement procedures following warning signals in case they indicate an emerging or recurrent problem.

  • The z’-Scores

When there is concern about the uncertainty of an assigned value u(xa), (e.g. when u(xa) > 0.3spt, then the uncertainty can be taken into account by expanding the denominator of the performance score.

This statistic called a z’-score is calculated as follows:

The criteria of assessing the laboratory’s performance by the z’-scores are the same as those of z-scores.

  • The Q-scores

An alternative scoring system is the Q-score where:

The above equation is essentially a relative measure of laboratory’s bias and does not take the target standard deviation into account. In the ideal situation, the distribution of Q-scores will be surrounding zero value when there is no significant bias in the participants’ measurement results.

Any interpretations of results are based on the set criteria of acceptability by the PT program organizer, such as setting acceptable percentage deviation from the target value. 

Issues to be considered by the laboratory found to have unsatisfactory score:

  1. Look at the overall performance of all participants in this round.  If a large number of them obtained unsatisfactory results, it may indicate that the problem might not lie within your laboratory.
  2. Did you use a test method that had very different performance criteria as compared with the others?
  3. Also look at the test sample factor.  Did the test material sent by the PT organizer differ significantly from the scope of the laboratory’s normal operation? Was there any sample storage condition compromised in this round of test?
  4. On the PT scheme itself, were there enough participants in this comparison exercise?  Small number of results may render the statistical conclusion unreliable.

If none of the above applies, the laboratory shall initiate a corrective action to investigate the cause of this unsatisfactory result, and implement and document any appropriate corrective actions taken.

There are many possible causes of unsatisfactory performance. Some of these are listed below:

  • Incorrect calibration of instrument;
  • Analytical instrument performance not optimized, such as the choice of inappropriate wavelength of element in the ICP analysis due to other elemental interference
  • Analytical error such as too much or too few dilutions
  • Error in some critical steps during sample pre-treatment such as incomplete analyte extraction from the sample or improper handling of its clean-up process
  • Improper choice of test method as compared with the methods of others
  • Calculation errors; transcription errors
  • Results reported in incorrect units

In conclusion, participation in well run PT schemes let us gain information on how our measurement results compare with those of others, whether our own measurements improve or deteriorate with time, and how our own laboratory’s performance is compared with an external quality standard.  In short, the aim of such schemes is the evaluation of the competence of analytical laboratories. PT program, indeed, is plays an important part in a laboratory’s QA/QC system.

Proficiency testing -1

Proficiency testing – what, why and how (Part I)

We may want to make claims that our analytical results are reliable and accurate to the data users, but it is no better than if we can show proof to them the proficiency testing (PT) program reports that we have participated in testifying our good quality standing. So, what is proficiency testing?

The ISO 13528:2015 defines proficiency testing as “evaluation of participant performance against pre-established criteria by means of interlaboratory comparisons”.

Therefore, a proficiency testing program typically involves the simultaneous distribution of sufficiently homogeneous and stable test samples to laboratory participants. It is usually organized by an independent PT provider which is an organization that takes responsibility for all tasks in the development and operation of a proficiency testing scheme.

The participants are to analyze the samples using either a method of their choice or a specified standard method, and submit their results to the scheme organizers, who will then carry out statistical analysis of all the data and prepare a final PT report showing the ‘scores’ of all participants to allow them to judge their own performance in that particular round. Ideally, all participants should conduct the testing with the same standard method for more meaningful comparison of results.

The scores are reflections of the difference between the participants’ results and a target or assigned value in that round with a quality target, usually in the form of standard deviation. Such comparison is important as it gives an allowance for measurement error. The scoring system should set acceptability criteria to allow participants to evaluate their performance.

The primary aim of PT therefore is to allow participating laboratories to monitor and optimize the quality of its routine analytical measurements. It may be noted that it is concerned with the assessment of participant performance and as such does not specifically address bias or precision.

There are several international guidelines and standards for organizing a PT round and statistical analytical methods for these inter-laboratory comparison data. The well known ones are ISO 13528:2015 Statistical methods for use in proficiency testing by interlaboratory comparison, which provides statistical support for the implementation of ISO/IEC 17043:2010 Conformity assessment — General requirements for proficiency testing, which describes the general methods that are used in proficiency testing schemes. 

How is a PT program organized?

There are two important steps in the organization of a PT scheme, that is specifying the assigned value for the samples to be analyzed, and secondly, setting the standard deviation for the proficiency assessment.

The organizer has to decide whether the assigned value and criterion for assessing deviations should be independent of participant results, or should be derived from the results submitted. in general, choosing assigned values and assessment criteria independently of participant results offers advantages.

Indeed, these two directly affect the scores that participants receive and therefore how they will interpret their performance in the scheme.

  1. Setting the assigned value

The assigned value is the value attributed to a particular quantity being measured. It is accepted by the scheme organizer as having a suitable small uncertainty which is appropriate for a given purpose.  

There are a number of approaches to obtain the assigned value:

  • Obtained by formulation, through adding a known amount of concentration of the target analyte to a base material containing no such analyte (or a trace but well characterized amount.)
  • Being a certified reference value due to the test material is a certified reference material (CRM).
  • Being a reference value determined by a single expert laboratory using a primary or classical method of analysis (e.g., gravimetry, titrimetry, isotope dilution mass spectrometry), or a fully validated test method which has been calibrated with CRMs.
  • Obtained from consensus of a number of expert laboratories after having analyzed the material using suitable methods.
  • Taken from consensus of the particular program participants. The consensus value in this case is usually based on a robust estimate of the mean of all the participating laboratories, in order to minimize the effect of extreme values in the data set.
  • Estimating standard deviation for proficiency assessment

The standard deviation for proficiency assessment is set by the scheme organizer. It is intended to represent the uncertainty regarding as fit for purpose for a particular type of analysis. Ideally, the basis for setting the standard deviation should remain the same over successive rounds of the PT program so that interpretation of performance scores is consistent over different rounds.

Also, due allowance for changes in performance at different analyte concentrations is usually made to make it easier for participants to monitor their performance over time.

There are a number of different approaches for performance evaluation.

a)   Using the repeatability and reproducibility standard deviations from a previous collaborative study of precision of a test method

This approach to defining the standard deviation for performance evaluation is based on the results from a previous reproducibility experiment via collaborative study by using the same analytical method, if any.  In this case, we can look for the reproducibility and repeatability estimates from the study.

The standard deviation for proficiency assessment, spt, is given by


b)   By experience from previous rounds of a proficiency testing scheme

This is determined by experience with previous rounds of proficiency testing for the same analyte with comparable concentrations, and where participants use compatible measurement procedures. Such evaluations will be based on reasonable performance expectations.

  • By ‘perception’

The standard deviation is chosen to ensure that laboratories that obtain a satisfactory score are producing results that are fit for a particular purpose, such as being related to a legislative requirement.  It can also be set to reflect the perceived performance of laboratories or to reflect the performance that the PT organizer and participants would like to be able to achieve.

  • From data obtained in the same round of a PT program

With this approach, the standard deviation for proficiency assessment, spt, is calculated from the results of participants in the same round of the proficiency testing scheme.

This approach is relatively simple and has been conventionally accepted due to successful use in many situations. The data from the PT program are assessed using the robust mean of participant results as the assigned value.

When this approach is used it is usually most convenient to use a performance score such as the z score.

  • From a general model, Horwitz function

The Horwitz function is an empirical relationship based on statistics from a very large number of collaborative studies for chemical applications over an extended period of time.  It describes how the reproducibility standard deviation varies with the analyte concentration level:

where c is the concentration of the chemical species to be determined in mass fraction 0 ≤ c ≤ 1.  (e.g. 1 mg/kg = 10-6).

This approach however, does not reflect the true reproducibility of certain test materials, bur is useful when the number of participants is running short for any other more meaningful statistical comparison.

The next article will discuss the assessment scoring systems commonly adopted in proficiency testing programs.

What are the types of precision estimates?

Accuracy, Precision & Trueness


When we evaluate the validity of a test result, we are mostly concern if the performance of the test method used is precise and reproducible enough to fit for a particular purpose or to meet the customer’s requirements.  That concern also includes in some cases whether the method detection limit is low enough to meet the regulatory or specification limits required.  Read on ….Types of precision estimates