When comparing several means one is often interested in determining the highest or lowest mean. More generally, one might wish to determine the order of these means.
As an example consider the scores obtained by students in a uniform final exam. The students are from five different sections of the same statistics course. The test was a machine-graded multiple choice exam. Here is the data.
| final.scores <- read.csv("final.scores.csv") |
| boxplot(score ~ section, data=final.scores) |
I cannot see from this boxplot which of the two sections, 2 or 5, has the higher mean. Here is a way to calculate these means.
| aggregate(score ~ section, data=final.scores, mean) |
The ordering in the sample means suggests the following order in the population means. \[\mu_3 \lt \mu_1 \lt \mu_5 \lt \mu_2 \lt \mu_4.\]
Is there evidence that \(\mu_3 \lt \mu_1\)? If not, is there evidence that \(\mu_3 \lt \mu_5\)? Before we do the experiment there are potentially 10 comparisons. Suppose that the chance of an error is 5% in each comparison. What would be the chance of making at least one error in these 10 comparisons? If we published research containing many hypothesis tests where each test has a 5% error rate then there is a good chance that at least some of our results will be wrong. This is the problem we want to address in this lesson.
We would like that all the results we obtain from an experiment are very likely to be true.
The familywise error rate of a procedure consisting of multiple hypothesis tests is the probability of at least one type I error in these tests when the procedure is applied to appropriately collected data.
The familywise error rate is also known as the experimentwise or the simultaneous error rate.
The familywise confidence level of a procedure calculating multiple confidence intervals is the probability that each interval will contain the value it is estimating when the procedure is applied to appropriately collected data.
The familywise confidence level is also known as the experimentwise or the simultaneous confidence level.
A popular but typically not the best method to control the familywise error rate or familywise confidence level is to use the so-called Bonferroni correction.
Suppose that \(A_1\) and \(A_2\) are two events from a sample space. Recall that \[P(A_1 \cup A_2) = P(A_1) + P(A_2) - P(A_1 \cap A_2)\] and therefore \[P(A_1 \cup A_2) \leq P(A_1) + P(A_2).\]
More generally, one can show that \[P(A_1 \cup A_2 \cup \ldots \cup A_k) \leq P(A_1) + P(A_2) + \ldots P(A_k).\] This inequality is known as Boole's inequality. In words, the probability that any of the events \(A_i\) is going to happen is less than or equal to the sum of the probabilities \(P(A_i)\).
Now let \(A_i\) denote the event \(\theta_i\notin I_i\). Then Boole's inequality implies that the probability that at least one interval \(I_i\) will not contain the parameter \(\theta_i\) is at most the sum of the probabilities \(P(\theta_i\notin I_i)\).
Now suppose that you calculate each interval \(I_i\) with a method for which \(P(\theta_i\notin I_i)=\alpha\). Then the probability that at least one interval \(I_i\) will not contain the parameter \(\theta_i\) is at most \(k\alpha.\) It follows that the familywise confidence level is at least \(1-k\alpha.\) A similar argument can be made for the familywise error rate.
Bonferroni: In order to get a familywise error rate of \(\alpha\) for a family of \(k\) hypothesis tests, perform each test at significance level \(\alpha/k\). In order to get a familywise confidence level of \(1-\alpha\) for a family of \(k\) confidence intervals, calculate each interval at confidence level \(1-\alpha/k\).
Although the Bonferroni method is popular, it should never be used since there are better methods available (e.g. Holm method). See Adjusting for Multiple Testing When Reporting ResearchResults: The Bonferroni vs Holm Methods, American Journal of Public Health, 1996; 86: 726-728
| with(data=final.scores, pairwise.t.test(score, section)) # uses Holm by default |
For making pairwise comparisons of group means, the Tukey-Kramer method is recommended. This method calculates p-values that have been adjusted for multiple testing. This method makes the same assumptions as those for ANOVA.
| fm.aov <- aov(score ~ section, data=final.scores) |
| TukeyHSD(fm.aov, ordered=TRUE) |
It can be a bit tedious to find out from the output which sections are significantly different. Below is R code for a letter-based display that makes it easier. It requires the multcomp library and it also requires that the variable section is of class factor. To determine the class of a variable use the function class().
| class(final.scores$section) |
One can change the class of the variable section from "character" to "factor" by overwriting it as follows.
| final.scores$section <- factor(final.scores$section) |
An alternative solution is to simply import the file again but this time check the box for "Strings as factors" in the upcoming menu after clicking on the "Import Dataset" button.
| library(multcomp) |
| fm.aov <- aov(score ~ section, data=final.scores) |
| fm.tukey <- glht(fm.aov, linfct = mcp(section = "Tukey")) |
| cld(fm.tukey) # compact letter display |
Sections sharing the same letter are not significantly different at the chosen significance level (default is 5%).
According to the output, the scores in section 3 are significantly lower than the scores in sections 2 and 4.
Confidence intervals for the pairwise differences can be found using the function confint()
| confint(fm.tukey) |
Suppose you have several treatments and one control group. In that case you may only be interested in comparing each treatment group with the control group. In such a case Dunnett's method is recommended.
| library(multcomp) |
| fm.aov <- aov(score ~ section, data=final.scores) |
| fm.dunnett <- glht(fm.aov, linfct = mcp(section = "Dunnett")) |
| summary(fm.dunnett) |
All comparisons are made with the first level of the factor section
in our example. This is by default. To change the reference level to section 3 you can proceed as follows.
| final.scores$section <- relevel(final.scores$section, ref="s3") |
| fm.aov <- aov(score ~ section, data=final.scores) |
| fm.dunnett <- glht(fm.aov, linfct = mcp(section = "Dunnett")) |
| summary(fm.dunnett) |
Confidence intervals for the differences between the groups and the control can be found as follows.
| confint(fm.dunnett) |
The Tukey-Kramer method and Dunnett's method require that the data is normally distributed and that the variances do not differ between the groups. One needs to check these assumptions by investigating the residuals (e.g. using qqnorm() and a plot of the residuals against the group means). If these assumptions are not met then we need to analyze the data with a different method.
One option is to use the Pairwise Multiple Comparison of Mean Rank plus (PMCMRplus) package. This package contains implementations of various tests that can be used for multiple comparisons.
The following robust method is designed for pairwise comparisons of all medians, not means. First, one needs to perform the Kruskal-Wallis test to check if there is evidence that not all the medians are the same. If this test indicates significance, then one can use one of the many multiple comparison methods that are available in the PMCMRplus library. Below is an example of Conover's all-pairs comparison test.
| library(PMCMRplus) |
| kwAllPairsConoverTest(score ~ section, data=final.scores) |
This package can also perform the standard multiple comparison tests discussed above and many more.
| summaryGroup(tukeyTest(fm.aov)) |
| dunnettTest(fm.aov) |
A one-way ANOVA was conducted to compare final exam scores across five sections of an undergraduate statistics course. There was a statistically significant difference in performance among the sections, F(4, 119) = 5.43, p = .049, ω2 = .05. Post hoc comparisons using Tukey’s HSD test indicated that Section 3 (M = 68.79, SD = 13.24) scored significantly lower than Section 2 (M = 80.19, SD = 12.16) and Section 4 (M = 80.56, SD = 14.33), p = .045 for both comparisons. No other pairwise comparisons were significant.
Note that the report includes the recommended measure ω2 of the effect size of the one-way ANOVA. Omega squared (ω2) tells us how much variability in the outcome is explained by group differences. In our case, 5% of the variance in exam scores was due to section membership, indicating a small but potentially important effect.