### Posts tagged ‘Probability’

### Excel functions in Hypothesis Testing

Data analysis allows us to answer questions about the data or about the population that the sample data describes.

When we ask questions like “is the alcohol level in the suspect’s blood sample significantly greater than 50 mg/100 ml?” or “does my newly developed TEST method give the same results as the standard method?”, we need to determine the probability of finding the test data given the truth of a stated hypothesis (e.g. no significant difference) – hence “hypothesis testing” or also known as “significance testing”.

A hypothesis, therefore, is an assumptive statement which might, or might not, be true. We test the truth of a hypothesis, which is known as a null hypothesis, H_{o}, with parameter estimation (such as mean, µ or standard deviation, s) and a calculated probability for making a decision about whether the hypothesis is to be accepted (high *p* -value) or rejected (lower *p* -value) based on a pre-set confidence level, such as *p* = 0.05 for 95% confidence.

Whilst making a null hypothesis, we must also be prepared for an alternative hypothesis, H_{1}, to fall back in case the H_{o} is rejected after a statistic test, such as *F*-test or Student’s *t*-test. The H_{1} hypothesis can be one of the following statements:

H_{1}: s_{a} ≠ s_{b} (2-sided or 2-tailed)

H_{1}: s_{a} > s_{b} (1- right sided or 1- right tailed)

H_{1}: s_{a} < s_{b} (1- left sided or 1- left tailed)

Generally a simple hypothesis test is one that determines whether or not the difference between two values is significant. These values can be means, standard deviations, or variances. So, for this case, we actually put forward the null hypothesis H_{o }that there is no real difference between the two s’s, and the observed difference arises from random effects only. If the probability that the data are consistent with the null hypothesis falling below a pre-determined low value (e.g. *p* = 0.05 or 0.01), then the hypothesis is rejected at that probability.

For an illustration, let’s say we have obtained a *t* observed value after the Student’s *t*-statistic testing. If the *p*-value calculated is small, then the observed *t*-value is higher than the *t*-critical value at the pre-determined *p*-value. So, we do not believe in the null hypothesis and reject it. If, on the other hand, the *p* -value is large, then the observed value of t is quite likely acceptable, being below the critical* t*-value based on the degrees of freedom at a set confidence level, so we cannot reject the null hypothesis.

We can use the MS Excel built-in functions to find the critical values of *F*– and *t*-tests at prescribed probability level, instead of checking them from their respective tables.

In the *F*-test for *p*=0.05 and degrees of freedom *v* = 7 and 6, the following critical one-tail inverse values are found to be the same (4.207) under all the old and new versions of the MS Excel spreadsheet since 2010:

“=FINV(0.05,7,6)”

“=F.INV(0.95,7,6)”

“=F.INV.RT(0.05,7,6)”

But, for the *t*-test, the old Excel function “=TINV” for the one-tail significance testing has been found to be a bit awkward, because this function giving the *t*-value has assumed that it is a two-tail probability in its algorithm.

To get a one-tail inverse value, we need to double the probability value, in the form of “=TINV(0.05*2, *v*)”. This make explanation to someone with lesser knowledge of statistics difficult to apprehend.

For example, if we want to find a *t*-value at *p*=0.05 with *v* = 5 degrees of freedom, we can have the following options:

=TINV(0.05,5) | 2.5705 |

=TINV(0.05*2,5) | 2.0150 |

=T.INV(0.05,5) | -2.0150 |

=T.INV(0.95,5) | 2.0150 |

=T.INV.2T(0.05*2,5) | 2.0150 |

So, it looks like better to use the new function “=T.INV(0.95,5)” or *absolute* value of “=T.INV(0.05,5)” for the one-tail test at 95% confidence.

The following thus summarizes the use of T.INV for one- or two-tail hypothesis testing:

- To find the
*t*-value for a right-sided or greater than H_{1}test, use =T.INV(0.95,*v*) - To find the
*t*-value for a left-sided or less than H_{1}test, use =T.INV(0.05,*v*) - To find the
*t*-value for a two-sided H_{1}test, use =T.INV.2T(0.05,*v*)

### The Birthday Paradox

I bet you are not going to believe the following statements.

*There is a 50% chance for at least two persons in a group of 23 persons at a party to have the same birthday! *

* **When there is a group of 50 people, such chance is even very much higher, at 97%!*

Is this true? Do you want to check and confirm the trueness of these statements during your next partying?

Let’s reason out this phenomenon through the basic rules of probability.

Remember that if you flip a perfect coin, the probability of getting a head is 1/2 since one out of two possible outcomes is a head. If you flip a coin twice, the probability of getting both heads is (1/2)(1/2) or 1/4 or 0.25.

On the other hand, if a million people buy a ticket in the lottery draw and you buy one, then, the probability of you winning the first prize is 1/10^{6} or 0.000001 or 0.0001%. That chance sounds miserable, of course.

Now, there is a rule of probability known as the complement rule, which says the probability of something happening is one minus the probability it does not happen. So, the complement rule says the probability of not winning the lotteries is (1 – 0.000001) or 0.999999. That means the chance of winning is very slim indeed.

Let’s assume that we have 365 days in a year and so the probability that two persons have different birthdays is (365/365)(364/365) or 0.997. This calculation is understandable: the first person can have his or her birthday any day of the year and so the first factor is 365/365 or 1, whereas the second person must have his or her birthday on one of the remaining 364 days out of 365 days in order to be distinct from the first guy, giving rise to the second factor 364/365.

Hence, we can conclude that the probability that both persons have matching birthdays on any of the days in a calendar year is (1 – 0.997) or 0.003 or 0.3% chance. When we have 3 persons for consideration, the probability of having distinct birthdays is (365/365)(364/365)(363/365) or 0.992. As the number of people increases, the probability of having different birthdays decreases as shown in the table below:

No. of persons | Probability of no matching birthdays | Probability of at least one matching birthday |

2 | 0.997 | 0.003 |

3 | 0.992 | 0.008 |

4 | 0.984 | 0.016 |

5 | 0.973 | 0.027 |

6 | 0.960 | 0.040 |

7 | 0.944 | 0.056 |

8 | 0.926 | 0.074 |

9 | 0.905 | 0.095 |

10 | 0.883 | 0.117 |

11 | 0.86 | 0.14 |

12 | 0.83 | 0.17 |

13 | 0.81 | 0.19 |

14 | 0.78 | 0.22 |

15 | 0.75 | 0.25 |

16 | 0.72 | 0.28 |

17 | 0.68 | 0.32 |

18 | 0.65 | 0.35 |

19 | 0.62 | 0.38 |

20 | 0.59 | 0.41 |

21 | 0.56 | 0.44 |

22 | 0.52 | 0.48 |

23 | 0.49 | 0.51 |

30 | 0.29 | 0.71 |

35 | 0.19 | 0.81 |

50 | 0.03 | 0.97 |

100 | 0.0000004 | 0.9999996 |

From the above table, we have shown that in a group of 23 people, the probability of matching birthdays is 0.51 or 51%, which is almost the same as flipping a coin! And, with 50 people, there is a 97% chance at least two persons have the same birthday and for 100 people, it is almost 100% sure of birthday matching!

In fact, by this reasoning, we see that when there are 366 people or more under this consideration for 365 days (no leap year, please), certainly the probability is perfectly 1 or 100% chance. It is just like putting 366 pigeons in 365 holes. There are bound to have one box with more than one pigeon inside. This is called the *Pigeonhole principle*.

The above illustration is called the Birthday Paradox. Why do we call it a paradox?

Wikipedia says that a **paradox** is a statement that, despite apparently sound reasoning from true premises, leads to a self-contradictory or a logically unacceptable conclusion. A paradox involves contradictory yet interrelated elements that exist simultaneously and persist over time. In other words, a paradox refers to trying to solve a problem when certain aspects of the problems are overlooked.

The reason that the probability of matching birthday is greater than what our intuition tells us is due to the fact that we have overlooked all the scenario that birthdays can match. When we say “matching birthdays” here, we only mean matching birthdays of *any two* persons, *any day* of the year, and nothing more. Definitely, there are more of these probabilities than what we have considered.

## Recent Comments