Posts tagged ‘Type I and II errors’
All testing and calibration laboratories accredited under ISO/IEC 17025:2017 are required to prepare and implement a set of decision rules when the customer requests for a statement of conformity in the test or calibration report issued.
As the word “conformity” is defined as “compliance with standards, rules and laws”, a statement of conformity is an expression that clearly describes the state of compliance or non-compliance to a specification, standard, regulatory limits or requirements, after calibration or testing.
Like any decision made, you have to assume a certain amount of risk as you might make a wrong decision. So, how much is a risk that you can comfortably undertake when you issue a statement of conformity in your test or calibration report?
Generally, decision rules give a prescription for the acceptance or rejection of a product based on:
- its uncertainty due to inherent errors (random and/or systematic)
- the specification (or regulatory) limit or limits, and,
- the acceptable risk level based on the probability of making a wrong decision
Certainly, you want to minimize our risk in issuing a statement of conformity that is to be proven wrong by others. But, what is the type of risk you are answering when making such decision rule? In short, it is either
- the supplier’s (laboratory’s) risk (statistically speaking, false positive or Type I error, alpha) or
- the consumer’s (customer’s) risk (false negative or Type II error, beta).
From the laboratory service point of view, you should be interested in the Type I (alpha) error to protect your own interest.
Before indulging further in the discussion, let’s take note of an important assumption, that is, the uncertainty of measurement is represented by a normal (Gaussian) probability distribution function, which is consistent with the typical measurement results (being assumed the applicability of the Central Limit Theorem).
After calibration or testing an item with its measurement uncertainty known, our subsequent statement of conformance with a specification or regulatory limits can lead us to 2 possible outcomes:
- We are right
- We are wrong
The decision rule made is related to statistical hypothesis testing where we propose a null hypothesis Ho for a situation and an alternative hypothesis H1 should Ho be rejected after some test statistics. In this case, we can make either a Type I (false POSITIVE or false ALARM, i.e. rejecting null hypothesis Ho when in fact Ho is true) or Type II (false NEGATIVE, i.e. not rejecting Ho when in fact Ho is actually false) errors.
It follows that the probabilities of making the correct decisions are (1 – alpha) and (1 – beta), respectively. Generally we would take a 5% Type I risk, hence we had alpha = 0.05 and would claim that we have 95% confidence in making this statement of conformity.
In layman’s language:
- Type I : Deciding that something is NOT OK when it actually is OK, given the probability (risk): alpha
- Type II: Deciding something is OK when it really was NOT OK, given the probability (risk): beta
Figure 1 shows the matrix of such decision making and potential errors involved:
The statistical basis of the decision rules is to determine where the “Acceptance zone” and the “Rejection zone” are, such that if the measurement result lies in the acceptance zone, the product is declared compliant, and, if in the rejection zone, it is declared non-compliant. Graphically, it can be shown as in Figure 2 below:
|Figure 2: Display of results with measurement uncertainties around specification limits|
We should not have any issue in deciding the conformity in Case 1 and non-conformity in Case 4 due to a clear cut situation as shown in Figure 2 above, but we need to assess if Cases 2 and 3 are in conformity or not, as illustrated in Figure 3 below for an upper specification limit:
For the situations in Cases 2 and 3, we may include the following thoughts in the decision rule making before considering the amount of risk to be taken in deciding conformity:
- Making a request for additional measurement(s)
- Re-evaluating measurement uncertainty to narrow the range, if possible
- A manufactured (and tested) product must be compared with an alternative specification to decide on possible sale at a discounted price, as a rejected goods
Part B of this article will discuss both simple and more complicated decision rules that can be made during issuing statement of conformance after testing or calibration. Before that, we shall study a practical worked example.
I would like to share some of the ideas picked up at the 2-day Eurachem / Pancyprian Union of Chemists (PUC) joint training workshop on 20-21 February 2020, titled “Accreditation of analytical, microbiological and medical laboratories – ISO/IEC 17025:2017 and ISO 15189:2012”, after flying all the way from Singapore to Nicosia of Cyprus via a stop-over at Istanbul.
Today, let’s see whether there is a requirement for an expression of uncertainty in qualitative analysis. In other words, are there quantitative reports of uncertainties in qualitative test results?
Qualitative chemical and microbiological testing usually fall under the following binary classifications with two outcomes only:
- Pass/Fail for a targeted measurand
- “Above” or “Below” a limit
- Red or yellow colour
- Classification into ranges (<2; 2 – 5; 5 – 10; >10)
- Authentic or non-authentic
Many learned professional organizations have set up working groups to study on expression of uncertainty for such types of qualitative analysis for many years and have yet to officially publish guidance in this respect.
The current thinking refers to the following common approaches:
- Using false positive and negative response rates
In a binary test, we can get result to be a true positive (TP) or a true negative (TN). There are two kinds of errors associated in such testing, giving rise to a false positive (FP) or a false negative (FN) situation.
A false positive error occurs in data reporting when a test result improperly indicates presence of a condition, such as a harmful pathogen in food, when in reality it is not present (being a Type I error, statistically speaking), while a false negative is an error in which a test result improperly indicates absence of a condition when in reality it is present (i.e. a Type II error).
Consequently, the false positive response rate, which is equal to the significance level (Greek letter alpha, α) in statistical hypothesis testing, is the ratio of those negatives that still yield positive test outcomes against the total observations. The specificity of the test is then equal to 1−α.
Complementarily, the false negative rate is the proportion of positives which yield negative test outcomes with the test. In statistical hypothesis testing, we can express this fraction a letter beta β (for a Type II error), and the “power” or the “sensitivity” of the test is equal to 1−β.
See table below:
2. Alternative performance indicators (single laboratory)
The alternative performance indicators are actually reliability measures involving several formulae, as summarized below:
|False positive rate||FP/(TN+FP)|
|False negative rate||FN/(TP+FN)|
|Positive predictive value||TP/(TP+FP)|
|Youden Index||Sensitivity + Specificity – 1|
|Likelihood ratio||(1-False negative rate)/False positive rate|
There are many challenges to evaluate qualitative “uncertainty”. Although the idea of estimating uncertainty for such binary results is sound, the most problematic one is how to collect hundreds of experimental data in order to make reasonable statistical estimates for low false response rates. Another challenge is how to confidently estimate the population probabilities in order not to be bias. A sensible suggestion is to ask laboratories to following published codes of best practices in qualitative testing where they are available and to ensure the conditions of testing are under adequate control.
At this moment, quantitative (i.e. numerical) reports of uncertainties in qualitative test results, involving strict metrological and statistical calculations, are not generally expected by the accreditation bodies.
In my training workshops on decision rule for making statement of conformity after laboratory analysis of a product, some participants have found the subject of hypothesis testing rather abstract. But in my opinion, an understanding of the significance of type I and type II error in hypothesis testing does help to formulate decision rule based on acceptable risk to be taken by the laboratory in declaring if a product tested conforms with specification.
As we know well, a hypothesis is a statement that might, or might not, be true until we put it to some statistical tests. As an analogy, a graduate studying for a Ph.D. degree always carries out research works on a certain hypothesis given by his or her supervisor. Such hypothesis may or may not be proven true at the conclusion. Of course, a breakthrough of the research in hand means that the original hypothesis, called null hypothesis is not rejected.
In statistics, we set up the hypothesis in such as way that it is possible to calculate the probability (p) of the data, or the test statistic (such as Student’s t-tests) calculated from the data, given the hypothesis, and then to make a decision about whether this hypothesis is to be accepted (high p) or rejected (low p).
In conformity testing, we treat the specification or regulatory limit given as the ‘true’ or certified value and our measurement value obtained is the data for us to decide whether it conforms with the specification. Hence, our null hypothesis Ho can be put forward as that there is no real difference between the measurement and the specification. Any observed difference arises from random effects only.
To make decision rule on conformance in significance testing, a choice about the value of the probability below which the null hypothesis is rejected, and a significant difference concluded, must be made. This is the probability of making an error of judgement in the decision.
If the probability that the data are consistent with the null hypothesis Ho falls below a pre-determined low value (say, alpha = 0.05 or 0.01), then the hypothesis is rejected at that probability. Therefore, a p<0.05 would mean that we reject Ho with 95% level of confidence (or 5% error) if the probability of the test statistic, given the truth of Ho, falls below 0.05. In other words, if Ho were indeed correct, less than 1 in 20 repeated experiments would fall outside the limits. Hence, when we reject Ho, we conclude that there was a significant difference between the measurement and the specification limit.
Gone are the days when we provide a conformance statement when the measurement result is exactly on the specification value. By doing so, we are exposed to a 50% risk of being found wrong. This is because we either have assumed zero uncertainty in our measurement (which cannot be true) or the specification value itself has encompassed its own uncertainty which again is not likely true.
Now, in our routine testing, we would have established the measurement uncertainty (MU) of test parameter such as contents of oil, moisture, protein, etc. Our MU as an expanded uncertainty has been evaluated by multiplying a coverage factor (normally k = 2) with the combined standard uncertainty estimated, with 95% confidence. Assuming the MU is constant in the range of values tested, we can easily determine the critical value that is not significantly different from the specification value or regulatory limit by the use of Student’s t-test. This is Case B in the Fig 1 below.
So, if the specification has an upper or maximum limit, any test value smaller than the critical value below the specification estimated by the Student’s t-test can be ‘safely’ claimed to be within specification (Case A). On the other hand, any test value larger than this critical value has reduced our confidence level in claiming within specification (Case C). Do you want to claim that the test value does not meet with the specification limit although numerically it is smaller than the specification limit? This is the dilemma that we are facing today.
The ILAC Guide G8:2009 has suggested to state “not possible to state compliance” in such situation. Certainly, the client is not going to be pleased about it as he has used to receive your positive compliance comments even when the measurement result is exactly on the dot of the upper limit.
That is why the ISO/IEC 17025:2017 standard has required the accredited laboratory personnel to discuss his decision rule with the clients and get their written consent in the manner of reporting.
To minimize this awkward situation, one remedy is to reduce your measurement uncertainty range as much as possible, pushing the critical value nearer to the specification value. However, there is always a limit to do so because uncertainty of measurement always exists. The critical reporting value is definitely going to be always smaller than the upper limit numerically in the above example.
Alternatively, you can discuss with the client and let him provide you his acceptance limits. In this case, your laboratory’s risk is minimized greatly as long as your reported value with its associated measurement uncertainty is well within the documented acceptance limit because your client has taken over the risk of errors in the product specification (i.e. customer risk).
Thirdly, you may want to take a certain calculated commercial risk by having the upper uncertainty limit extended into the fail zone above the upper specification limit, due to commercial reasons such as keeping good relationship with an important customer. You may even choose to report a measurement value that is exactly on the specification limit as conformance. However, by doing so, you are taking a 50% risk to be found err in the issued statement of conformance. Is it worth taking such a risk? Always remember the actual meaning of measurement uncertainty (MU) which is to provide a range of values around the reported number of the test, covering the true value of the test parameter with 95% confidence.
Why do we perform hypothesis tests?
A course participant commented the other day that descriptive statistical subjects were much easier to understand and could be appreciated, but not the analytical or inferential statistics which call for logical reasoning and inferential implications of the data collected.
I think the core issue lies on the abstract nature of inferential statistics. Hypothesis testing is a good example. In here, we need to determine the probability of finding the data given the truth of a stated hypothesis.
A hypothesis is a statement made that might, or might not, be true.
Usually the hypothesis is set up in such a way that it is possible for us to calculate the probability (P) of the data (or the test statistic calculated from the data) given the hypothesis, and then to make a decision about whether the hypothesis is to be accepted (high P) or rejected (low P).
A particular case of a hypothesis test is one that determines whether or not the difference between two values is significant – a significance test.
For this case, we actually put forward the hypothesis that there is no real difference and the observed difference arises from random effects. We assign this as the null hypothesis (Ho).
If the probability that the data are consistent with the null hypothesis (HO) falls below a predetermined low value (say, 0.05 or 0.01), then the HO hypothesis is rejected at that probability.
Therefore, p<005 means that if the null hypothesis were true, we would find the observed data (or more accurately, the value of the test statistic, or greater, calculated from the data) in less than 5% of repeated experiments.
To use this in significance testing, a decision about the value of the probability below which the null hypothesis is rejected, and a significance difference concluded, must be made.
In laboratory analysis, we tend to reject the null hypothesis “at the 95% level of confidence” if the probabiity of the test statistic, given the truth of HO falls below 0.05. In other words, if HO is indeed correct, less than 5% (i.e. 1 in 20 numbers) averages of repeated experiments would fall outside the limits. In this case, it is concluded that there was a significant difference.
However, it must be stressed that the figure of 95% is a somewhat arbitrary one, arising because of the fact that (mean +2 standard deviation) covers about 95% of a population.
With modern computers and spreadsheets, it is possible to calculate the probability of the statistic given a hypothesis, leaving the analyst to decide whether to accept or reject it.
In deciding what a reasonable level to accept or reject a hypothesis is, i.e. how significant is “significant”, two scenarios, in which the wrong conclusion is arrived at, need to be considered. Therefore, there is a “risk” in making a wrong decision at a specified probability.
A so-called Type I error is in the case where we reject a hypothesis when it is actually true. It may also be known as “a false positive ”.
The second scenario is the opposite of this, when the significance test leads to the analyst wrongly accepting the null hypothesis although in reality HO is false (a Type II error or a false negative).
We had discuss these two types of error in the short articles: https://consultglp.com/2017/12/28/type-i-and-type-ii-errors-in-significance-tests/ , and, https://consultglp.com/2017/03/01/sharing-a-story-of-type-i-error/