Pearson's Chi-Square Tests

In this lesson you will learn about two tests, a goodness-of-fit test and a test for independence. In both tests we will use a test statistic, denoted by \(X^2 ,\) that was suggested by the statistician Karl Pearson in 1900. The test statistic turns out to have a distribution that can be approximated by the so-called chi-square distribution.

Goodness-of-fit Test

A goodness-of-fit test is used to check whether or not an observed distribution differs from a theoretical distribution. The test will be explained by an example.

A factory produces marbles in the sizes small, medium, and large. A third of the marbles are supposed to be small, half of them medium, and a sixth are supposed to be large.

The appropriate test for evaluating this claim is \[\begin{align*} & H_0: \quad p_1 = \frac{1}{3},\quad p_2 = \frac{1}{2},\quad p_3 = \frac{1}{6}\\ & H_A: \quad \mbox{not all equalities hold in } H_0 \end{align*}\] where \(p_1\) denotes the proportion of small marbles, \(p_2\) the proportion of medium marbles, and \(p_3\) the proportion of large marbles produced by the factory.

To perform the test you are presented with a simple random sample of say 120 marbles from the factory.

If the null hypothesis is true then how many of the 120 marbles in the sample do you expect to be small, how many to be medium, and how many large?

Denote the expected value for the small size by \(E_1,\) the expected value for medium by \(E_2,\) and the expected value for large by \(E_3.\)

If the null hypothesis is true, then a third of the marbles should be small, and therefore we expect a third of 120, that is 40 marbles, to be small. In general, to find the expected value we just multiply the sample size by the hypothesized probability and obtain \[E_1 = 40,\quad E_2 = 60, \quad \mbox{ and } E_3 = 20.\]

We sort the simple random sample of marbles from the factory by size and observe that there are \(O_1 = 25\) small, \(O_2 = 72\) medium, and \(O_3 = 23\) large marbles.

How close are the observed values to the expected values?

Karl Pearson proposed measuring the difference between the observed and expected values by the quantity \[X^2 = \sum{\frac{(O_i - E_i)^2}{E_i}.}\]

In our example \[\begin{align*}X^2 & = \sum{\frac{(O_i - E_i)^2}{E_i}}\\ & = \frac{(25-40)^2}{40}+\frac{(72-60)^2}{60}+\frac{(23-20)^2}{20}\\ & = 8.475.\end{align*}\]

A large value of \(X^2\) indicates a poor fit of the data to the expected values, which are calculated under the assumption that the null hypothesis is true. A large value of \(X^2\) is therefore considered evidence that the null hypothesis is not true. Is 8.475 a large value?

To answer this question we need to know how likely it is to obtain a value of \(X^2\) that is 8.475 or larger when the null hypothesis is in fact true.

We can estimate this probability by simulation. Imagine that you have a factory that produces marbles according to the null hypothesis. You take a simple random sample of size 120 and calculate the value of \(X^2 .\) Now repeat this many times, say 10,000 times. Then we have 10,000 values of \(X^2 .\) How many of these values are 8.475 or larger? Suppose there are 135. Then the probability of obtaining such a large value of \(X^2\) or larger by chance can be estimated by \(135/10000 = 0.0135.\) Simulation is a good method if the distribution of \(X^2\) is unknown.

The reason Karl Pearson chose the particular formula we used above for measuring the distance between observed and expected counts is that the resulting statistic has a known distribution, at least approximately. It has the so-called chi-squared distribution, provided that the sample size is sufficiently large. The chi-squared distribution has a parameter called the degrees of freedom and is often denoted by \(df.\)

In the particular example, the distribution of \(X^2\) is approximately chi-squared with 2 degrees of freedom. The p-value of this test, which is the probability of observing an \(X^2\) of 8.475 or larger, can be (approximately) calculated with the function pchisq().

1-pchisq(8.475, df=2)

Since the obtained p-value is 0.0144, we conclude that there is evidence at the significance level \(\alpha = 0.05\) that the proportions of small, medium, and large marbles are not as stipulated in the null hypothesis.

The type of test we just performed can be used whenever there are two or more categories. Denote the number of categories by \(k.\) Then the null hypothesis can be stated as follows: \[ H_0: p_i = p_{i,0} \quad \mbox{ for } \quad i=1, \ldots, k.\] The alternative hypothesis is the negation of the null hypothesis, that is, it states that at least one of the equalities in the null hypothesis does not hold. In this case the test statistic is given by \[ X^2 = \sum_{i=1}^k {\frac{(O_i-E_i)^2}{E_i}},\] where \(O_i\) denotes the observed frequency and \(E_i\) the expected frequency of the i-th category. The expected frequencies are calculated by the formula \(E_i=n p_{i,0}\), where \(n\) denotes the sample size. If the sample size is sufficiently large, and the null hypothesis is true, then the resulting test statistic \(X^2\) has approximately the chi-squared distribution with \(df = k-1\) degrees of freedom. As a rule of thumb, many statisticians consider the chi-square approximation to be valid if all expected values are at least 5.

Here is another example for this type of test. In one of his pea breeding experiments, Mendel observed the following frequencies of seeds:

Does this data fit a 9:3:3:1 ratio?

Label the category round and yellow as 1, round and green as 2, wrinkled and yellow as 3, and wrinkled and green as 4. Denote by \(p_i\) the probability that an observation is of category \(i\) for \(i=1, 2, 3, 4.\)

The data fit a 9:3:3:1 ratio if and only if

category1234
probability9/163/163/161/16

Therefore the appropriate test is: \[\begin{align*} & H_0: p_1 = \frac{9}{16},\quad p_2 = \frac{3}{16},\quad p_3 = \frac{3}{16}, \quad p_4 = \frac{1}{16}\\ & H_A: \mbox{not all equalities hold in } H_0 \end{align*}\]

In this example there are four categories and we arbitrarily label these four categories with the numbers 1, 2, 3, and 4.

Let's calculate \(X^2\) for our example. The observed frequencies are \[O_1 = 315,\quad O_2 = 108,\quad O_3 = 101,\quad O_4 = 32\] and since \(n=315 + 108 + 101 +32 = 556\), the expected frequencies are \[E_1 = n\cdot p_{1,0} = 556\cdot 9/16=312.75,\] \[E_2 = n\cdot p_{2,0} = 556\cdot 3/16=104.25,\] \[E_3 = n\cdot p_{3,0} = 556\cdot 3/16=104.25,\] and \[E_4 = n\cdot p_{4,0} = 556\cdot 1/16=34.75.\] Therefore \[\begin{align*} X^2 & = \frac{(O_1-E_1)^2}{E_1}+\frac{(O_2-E_2)^2}{E_2}\\ & +\frac{(O_3-E_3)^2}{E_3}+\frac{(O_4-E_4)^2}{E_4}\\ & = \frac{(315-312.75)^2}{312.75}+\frac{(108-104.25)^2}{104.25}\\ & +\frac{(101-104.25)^2}{104.25}+\frac{(32-34.75)^2}{34.75}\\ & \approx 0.47 \end{align*}\] We would like to calculate the p-value for this test statistic. The p-value is the probability \(P(X^2\geq 0.47)\) calculated under the assumption that the null hypothesis is true. Since the distribution of \(X^2\) is approximately chi-squared with 3 degrees of freedom (the degrees of freedom is the number of categories minus 1), we can calculate the p-value as follows.

1-pchisq(0.47, df=3)

We obtain a p-value of 0.925431 and therefore the data is in good agreement with the null hypothesis.

You've learned that the hypothesis that you are trying to establish has to be the alternative hypothesis. If you want to show that a certain die rolls a six with a probility \(p\) — which is different from \(1/6\) — then the appropriate hypothesis test is \[H_0: p=\frac{1}{6}\quad \mbox{ versus } \quad H_A: p\neq \frac{1}{6}.\]

What if you want to establish that the die rolls a six with probability \(p =1/6\)? Should the alternative hypothesis then be \(H_A: p = 1/6\)?

No! This is not going to work, because there is no way that we can discredit the hypothesis \(p\neq \frac{1}{6}.\) The smaller the difference between \(p\) and \(1/6 ,\) the harder it is to detect such a difference with an experiment. If \(p\) is just a tiny bit different from \(1/6 ,\) then the hypothesis \(p\neq \frac{1}{6}\) is still true.

We cannot provide evidence for the hypothesis \(p=1/6\) in the same way in which we can provide evidence for the hypothesis \(p\neq 1/6\). The best we can do is perform the test \[H_0: p=\frac{1}{6}\quad \mbox{ versus } \quad H_A: p\neq \frac{1}{6}\] and if the test fails to reject the null hypothesis, which we expect to be the case if the null hypothesis is true, then we conclude that the result is consistent with the null hypothesis.

Failing to reject the null hypothesis does not provide convincing evidence that the null hypothesis is true, but it is the best we can do.

Goodness-of-fit Test for Poisson Distribution

A researcher wants to confirm that the Poisson distribution can be used to describe the distribution of litter sizes for a certain species. For that purpose she counts 100 litter sizes and collects the data into a table

size012345678
count71524201310632

A discrete random variable \(X\) is said to be Poisson distributed if its probability mass function is given by the following formula.

\[P(X=k) = \frac{\lambda^k}{k!}e^{-\lambda},\quad k=0, 1, 2, \ldots \]

Since the parameter \(\lambda\) is unknown we need to estimate it from the data. It turns out that \(E[X]=\lambda\) and therefore \(\lambda\) can be estimated by the average number of offspring in the sample.

size = c(0, 1, 2, 3, 4, 5, 6, 7, 8)
count = c(7, 15, 24, 20, 13, 10, 6, 3, 2)
lambda=sum(size*count)/sum(count)

The expected number of litters of size \(k\) in 100 litters is \[100P(X=k) = 100\frac{\lambda^k}{k!}e^{-\lambda}.\] Here are these values for \(k=0, 1, \ldots, 8.\)

100*dpois(0:8, lambda=lambda)

The last 3 values are less than 5 and so the chi-squared distribution may not be a good approximation to the test statistic which we want to use. One commonly used remedy is to combine some of the categories into a single category.

size0123456+
count7152420131011

The corresponding probabilities and expected values can be calculated as follows.

p <- c(dpois(0:5, lambda), 1-sum(dpois(0:5, lambda)))
expected = 100*p

The test statistic is given by the same formula as before.

observed <- c(7, 15, 24, 20, 13, 10, 11)
sum((observed-expected)^2/expected)

It turns out that the test statistic is approximately chi-squared distributed with 5 degrees of freedom. Usually the degrees of freedom is equal to the number of categories minus 1. However, in this case we loose another degree of freedom because we had to estimate the parameter \(\lambda\). The general rule is that the degrees of freedom is equal to the number of categories minus 1 minus the number of estimated parameters. The p-value is then calculated as follows.

1-pchisq(2.857072, df=5)

The p-value is approximately 0.72 which is consistent with the assumption that the data follows the Poisson distribution.

Test for Independence

A test for independence is used to check whether or not two categorical variables are independent. For example, a health researcher might wonder whether or not there is a relationship between tattoos and hepatitis C.

We consider a simple example. Three factories, say factory A, factory B, and factory C produce marbles in three sizes (small, medium, and large) and two colors (orange and blue).

Here is the distribution of colors and size for factory A:

size
smallmediumlarge
colororange\(\frac{2}{24}\)\(\frac{3}{24}\)\(\frac{1}{24}\)
blue\(\frac{6}{24}\)\(\frac{9}{24}\)\(\frac{3}{24}\)

The table indicates that \(2/24 = 1/12\) of all marbles are orange and small, \(3/24 = 1/8\) of all marbles are orange and medium, and so on.

Suppose you are given a randomly selected marble from factory A. What are the chances that this marble is orange?

The marble could be orange and small, orange and medium, and orange and large. Therefore the probability that the marble is orange is \[\frac{2}{24}+\frac{3}{24}+\frac{1}{24}= \frac{6}{24} = \frac{1}{4}.\]

Suppose you are told the size of that marble. Does that change the probability that the marble is orange?

To answer this question, recall the concept of conditional probability. Alternatively, you could imagine that factory A only produces 24 marbles each and use the resulting frequencies of marbles to find the answer.

Either way, you find that the marble is orange with probability 1/4, regardless of whether or not you know its size. In fact, for factory A, the probability for a certain color is independent from its size and the probability of a certain size is also independent from its color. We describe such a situation by saying that the variables colors and size are independent.

In general, two variables, such as colors and size, are said to be independent if their joint distribution is obtained by multiplying the marginal distributions. For example, the probability that the marble is orange and medium is \(3/24 = 1/8,\) which is the product of the probability that the marble is orange \((1/4)\) and the probability that the marble is medium \((1/2).\)

Here is the distribution of colors and size for factory B:

size
smallmediumlarge
colororange\(\frac{2}{24}\)\(\frac{9}{24}\)\(\frac{1}{24}\)
blue\(\frac{6}{24}\)\(\frac{3}{24}\)\(\frac{3}{24}\)

Can you tell the difference between the distributions of factory A and factory B?

For factory B, the variables colors and size are not independent since the probability of a medium orange marble is not the product of the probability that a marble is orange and the probability that a marble is medium.

The distribution of colors and size for factory C is not known.

Are the variables colors and size independent for factory C? To answer this question you are given a simple random sample of say 120 marbles from factory C.

size
smallmediumlarge
colororange\(6\)\(16\)\(4\)
blue\(41\)\(36\)\(17\)

The appropriate test for this question is:

\(H_0\): the variables colors and size are independent, versus
\(H_A\): the variables colors and size are not independent.

We can estimate the marginal distribution of the two variables from the data:

colororangeblue
probability\(\frac{26}{120}\)\(\frac{94}{120}\)

and

sizesmallmediumlarge
probability\(\frac{47}{120}\)\(\frac{52}{120}\)\(\frac{21}{120}\)

If you don't know how the marginal distributions are obtained, you should review the lesson on bivariate categorical data.

Assuming that the null hypothesis is true, we can estimate the joint distribution of the two variables as follows:

size
small \(\frac{47}{120}\)medium \(\frac{52}{120}\)large \(\frac{21}{120}\)
colororange \(\frac{26}{120}\)\(\frac{26}{120}\cdot \frac{47}{120}\)\(\frac{26}{120}\cdot \frac{52}{120}\)\(\frac{26}{120}\cdot \frac{21}{120}\)
blue \(\frac{94}{120}\)\(\frac{94}{120}\cdot \frac{47}{120}\)\(\frac{94}{120}\cdot \frac{52}{120}\)\(\frac{94}{120}\cdot \frac{21}{120}\)

Multiplying each probability by the sample size, which is 120 in our example, we obtain the expected values of the counts:

size
smallmediumlarge
colororange\(\frac{26\cdot 47}{120}\)\(\frac{26 \cdot 52}{120}\)\(\frac{26\cdot 21}{120}\)
blue\(\frac{94\cdot 47}{120}\)\(\frac{94 \cdot 52}{120}\)\(\frac{94\cdot 21}{120}\)

We can use the same measure of distance between observed and expected counts as we did previously as a test statistic. For each cell we calculate the squared difference between the observed and expected count, divide this number by the expected count, and add the result over all the cells.

It turns out that the resulting statistic, which is again denoted by \(X^2,\) is approximately chi-squared distributed with \(df\) degrees of freedom, where \(df = (r-1)\cdot(c-1),\) \(r\) denotes the number of rows, and \(c\) denotes the number of columns. The approximation is only valid if the sample size is sufficiently large. As a rule of thumb the sample size is sufficiently large if the expected value of each cell is at least 5.

In our example there are 2 rows (2 colors) and 3 columns (3 sizes) and therefore \(df = (2-1)\cdot(3-1) = 2.\) The value of the test statistic is 4.8173 and the probability of obtaining such a large \(X^2\) statistic is approximately 0.09.

1-pchisq(4.8173, df=2)

The expected value for large orange marbles is 4.55 and therefore the chi-squared approximation may not be appropriate.

R Code

The easier and recommended way for performing a Goodness-of-Fit Test or a Test for Independence is to use the R function chisq.test().

The hypothesized probabilities in Mendel's pea breeding experiment were 9/16, 3/16, 3/16, and 1/16. We assign these probabilities to a variable, say, prob.

prob = c(9, 3, 3, 1)/16

The observed counts were 315, 108, 101, and 32. We assign these counts to a variable, say, obs.

obs = c(315, 108, 101, 32)

All calculations of the goodness-of-fit test are then accomplished with the R function chisq.test().

chisq.test(obs, p=prob)

For a test for independence we have to create an array with the observed frequencies. We can do this for the marble example by combining the two rows into an array using the function rbind().

row1 = c(6, 16, 4)
row2 = c(41, 36, 17)
obs=rbind(row1, row2)

Alternatively, we can combine the three columns into an array using the function cbind(). There are also ways to create the array directly. The chi-square test for independence is then executed as follows.

chisq.test(obs)

Note that the results are exactly the same as those calculated in the more tedious way, above. There is, however, a warning message that the chi-square approximation may not be correct. The reason for this warning is that not all the expected values are large enough to justify the chi-square approximation. To avoid this problem we can estimate the p-value using a simulation.

chisq.test(obs, simulate.p.value=T)

By default, the number of replications used is 2000. Results are better if you increase this number. Let's use 100,000 replications.

chisq.test(obs, simulate.p.value=T, B=100000)