Training and consultancy for testing laboratories.

Archive for the ‘Basic statistics’ Category

Revisiting hypothesis testing

Revisiting Hypothesis Testing

Few course participants had expressed their opinions that the subject of hypothesis testing was quite abstract and they have found it hard to grasp its concept and application.  I thought otherwise. Perhaps let’s go through its basics again.

We know the study of statistics can be broadly divided into descriptive statistics and inferential or analytical statistics.   Descriptive statistical techniques (like frequency distributions, mean, standard deviation, variance, central tendency, etc.) are useful for summarizing data obtained from samples, but they also provide tools for more advanced data analysis related to a broader picture on population where the samples are drawn from, through the application of probability theories in sampling distributions and confidential intervals.  We use the analysis of sample data variation collected to infer what the situation of its population parameter is to be.

A hypothesis is an educated guess about something around us, as long as we can put it to test either by experiment or just observations. So, hypothesis testing is a statistical method that is used in making statistical decisions using experimental data.  It is basically an assumption that we make about the population parameter. In the nutshell, we want to:

  • make a statement about something
  • collect sample data relating to the statement
  • if given that the statement is true and the sample outcome is unlikely, we shall realize that the statement probably is not true.

In short, we have to make decisions about the hypothesis. The decisions are to decide if we should accept the null hypothesis or if we should reject the null hypothesis with certain level of significance.  Therefore, every test in hypothesis testing produces a significance value for that particular test.  In hypothesis testing, if the significance value of the test is greater than the predetermined significance level, then we accept the null hypothesis.  If the significance value is less than the predetermined value, then we should reject the null hypothesis.

Let us have a simple illustration.

Assume we want to know if a particular coin is fair.  We can give a statistical statement (null hypothesis, Ho) that it is a fair coin.  The alternative hypothesis, H1 or Ha, of course, is that the coin is not a fair coin.

If we were to toss the coin, say 30 times and got heads 25 times.   We take this as an unlikely outcome given it is a fair coin, we can reject the null hypothesis saying that it is a fair coin.

In the next article, we shall discuss the steps to be taken in carrying out such hypothesis testing with a set of laboratory data.

 

 

 

 

R in testing sample variances

R

Before carrying out a statistic test to compare two sample means in a comparative study, we have to first test whether the sample variances are significantly different or not. The inferential statistic test used is called Fisher’s F ratio-test devised by Sir Ronald Fisher, a famous statistician.  It is widely used as a test of statistical significance.

R and F-test

 

R and Student’s t-distribution

1526367075598

R and Student t distribution

 

Application of R in standardizing normal distribution

R

In the last blog, we discussed how to use R to plot a normal distribution with actual data in hand.  Surely there are plenty of different possible normal distribution since the mean value can be anything at all, and so can the standard deviation. Therefore, it will be useful if we can find a way to standardize the normal distribution for our convenience when several normal distributions can be compared on the same basis…..

Using R to standardize normal distribution.docx

 

Using R in Normal Distribution study

R

Using R to study the Normal Probability Distribution

 

Degrees of freedom

Degrees of freedom.png

Degrees of Freedom simply explained

Degrees of freedom, v, is a very important subject in statistics. As we know, estimates of statistical parameters such as mean, standard deviation, measurement uncertainty, F-test, Student’s t-test, etc., are based upon different amounts of information or data available.  So, the number of independent pieces of information that go into the estimation of a parameter are called the degrees of freedom.

This explanation may sound rather abstract. We can explain this concept easily by the following illustrations.

Suppose we had a sample of six numbers and their average (mean) value was 4.  The sum of these six numbers must have been 24, otherwise the mean would not have been 4.

So, now let us think about each of the six numbers in turn and put them in each of the six boxes as shown below.

If we allowed that the numbers could be positive or negative real numbers, how many values could the first number take? Of course, any value for the first number that we could think of would do the job.  Suppose it was a 4.

4

How many values could the next number take?  It could be again anything. Say, it was a 5.

4

5

And the third number?  Anything too. Suppose it was a 3.

4

5

3

The fourth and fifth numbers could also by anything. Say they were 6 and 4:

4

5 3 6

4

Now, we see that the very last number had to be just 2 and nothing else because the numbers had to add up to 24 to have the mean of the six numbers as 4.

4 5 3 6 4

2

So, we had total freedom in selecting the first number. It is true also for the second, third, fourth and fifth numbers. But we would have no choice at all in selecting the sixth number.  That means we had 5 degrees of freedom when we had to consider six numbers for their mean value (being a statistical parameter).

Generally speaking, we work on n-1 degrees of freedom if we estimate the sample mean from a sample of size n.  We use it in our estimation of sample standard deviation and other statistics.

So, we define it as:  degrees of freedom, v, is the sample size, n, minus the number of parameter(s), p, estimated from the data.

In the case of linear regression where we consider the linear equation y=a + bx, we have two statistical parameters to take care of, i.e., the y-intercept, a and the slope or gradient, b.  Hence, if we have 7 data points on a linear calibration curve, we have to put the degrees of freedom as 7 – 2 or 5.  In general, it is n-2 degrees of freedom for a linear regression study.

A short note on Excel Random function RAND()

Randomization

The MS Excel® “=RAND( )” function is one of the two commonly used to generate a random decimal number from zero to one in an Excel cell. Another one is “=RANDBETWEEN( )”, which generates a random integer in the range specified. Read on …. A note on Excel Random function

 

Outlier test statistics

Outlier2

It is quite difficult to identify suspected values visually as outliers from a set of data collected. Very often, an outlier statistic test is to be performed before further action such as deletion of such data or further testing, as we do not wish to discard such data without sound statistical justification. Read on … The outlier test statistic

What is an outlier?

outlying point

In repeated chemical analysis which assumes normal probability distribution, we may find some extreme (i.e. the biggest or smallest result) is a suspect which seems to be quite different from the rest of the data set.  In other words, this result does not seem to belong to the distribution of the rest of the data.  This suspect value is called an outlier.  Read more …. What is an outlier ?

 

Assumptions of using ANOVA

ANOVA

Assumptions of using ANOVA

Analysis of variance (ANOVA) is useful in laboratory data analysis for significance testing.  It however, has certain assumptions that must be met for the technique to be used appropriately.  Their assumptions are somewhat similar to those of regression because both linear regression and ANOVA are really just two ways of analysis the data that use the general linear model.  Departures from these assumptions can seriously affect inferences made from the analysis of variance.

The assumptions are:

  1. Appropriateness of data

The outcome variables should be continuous, measured at the interval or ratio level, and are unbounded or valid over a wide range.  The factor (group variables) should be categorical (i.e. being an object such as Analyst, Laboratory, Temperature, etc.);

  1. Randomness and independence

Each value of the outcome variable is independent of each other value to avoid biases. There should not have any influence of the data collected. That means the samples of the group under comparison must be randomly and independently drawn from the population.

  1. Distribution

The continuous variable is approximately normally distributed within each group. This distribution of the continuous variable can be checked by creating a histogram and by a statistical test for normality such as the Anderson-Darling or the Kolmogorov-Smirnov.  However, the one-way ANOVA F-test is fairly robust against departures from the normal distribution.  As long as the distributions are not extremely different from a normal distribution, the level of significance of the ANOVA F-test is usually not greatly affected by lack of normality, particularly for large samples.

  1. Homogeneity of variance

The variance of each of the groups should be approximately equal. This assumption is needed in order to combine or pool the variances within the groups into a single within-group source of variation SSW. The  Levene statistic test can be used to check variance homogeneity.  The null hypothesis is that the variance is homogeneous, so if the Levene statistic are not statistically significant (normally at alpha <0.05), the variances are assumed to be sufficiently homogeneous to proceed in the data analysis.