Tukey vs. Student

tidyverse

inference

Tukey

Author

Paolo Bosetti

Published

April 10, 2025

Abstract

To compare two samples, or groups, we can use a T-test. But if we want to compare more than two groups, we need to use Tukey’s test. In this post we investigate the reason why a Tukey’s test is more appropriate and robust than a set of pairwise T-tests for all possible combinations of groups. This is also an excuse to illustrate the power of purrr and dplyr packages, specifically for the use of map/reduce, join_left, and pivot_longer/pivot_wider functions.

Packages that we need

In this example, we are using the packages tidyverse and adas.utils version 1.1.4 (see https://github.com/pbosetti/adas.utils)

library(tidyverse)
library(adas.utils)

Repeated T-test vs. Tukey’s test

The dataset

Let us compare the result of a Tukey’s test with a repeated Student’s T-test on all combinations. We consider the cotton dataset, which is included in the adas.utils package from version 1.1.4. The dataset contains the tensile strength of mixed cotton-synthetic yarns with different cotton content:

cotton %>% 
  ggplot(aes(x=Cotton, y=Strength, group=Cotton)) +
  geom_boxplot()

Inference on `Strength`

Now we want to compare all the possibile combinations of treatmentswith a set of pairwise T-tests.

First, we create the list of pairwise combinations, sorting each pair in descending order, as it is done by the TukeyHSD function:

lvl <- levels(cotton$Cotton) %>% 
  combn(2, FUN=sort, decreasing=T) %>% 
  as_tibble(.name_repair="minimal") %>% 
  as.list() %>% glimpse()

List of 10
 $ : chr [1:2] "20" "15"
 $ : chr [1:2] "25" "15"
 $ : chr [1:2] "30" "15"
 $ : chr [1:2] "35" "15"
 $ : chr [1:2] "25" "20"
 $ : chr [1:2] "30" "20"
 $ : chr [1:2] "35" "20"
 $ : chr [1:2] "30" "25"
 $ : chr [1:2] "35" "25"
 $ : chr [1:2] "35" "30"

Now, for each pair we do a T-test on the corresponding cotton data-frame subset, and accumulate into a new tibble the values of interest. We get the df table that is analogous to the TukeyHSD output:

df <- lvl %>% reduce(\(acc, pair) {
  tt <- cotton %>% 
    filter(Cotton %in% pair) %>% 
    t.test(Strength~Cotton, data=., var.equal=TRUE)
  bind_rows(acc, list(
    pair = paste0(pair[1], "-", pair[2]),
    diff = -median(tt$conf.int),
    lwr = -tt$conf.int[2],
    upr = -tt$conf.int[1],
    p.value = tt$p.value
  ))
}, .init=tibble()) 

df %>% knitr::kable()

pair	diff	lwr	upr	p.value
20-15	5.6	0.8740978	10.3259022	0.0257453
25-15	7.8	3.7398608	11.8601392	0.0021967
30-15	11.8	7.4246648	16.1753352	0.0002541
35-15	1.0	-3.5423014	5.5423014	0.6253800
25-20	2.2	-1.6724395	6.0724395	0.2265324
30-20	6.2	1.9982605	10.4017395	0.0093233
35-20	-4.6	-8.9753352	-0.2246648	0.0415629
30-25	4.0	0.5641312	7.4358688	0.0277266
35-25	-6.8	-10.4461127	-3.1538873	0.0026133
35-30	-10.8	-14.7941163	-6.8058837	0.0002496

To be compared with Tukey’s values:

ttdf <- TukeyHSD(aov(lm(Strength~Cotton, data=cotton)))$Cotton %>% 
  as.data.frame() %>% 
  rownames_to_column(var="pair") %>% 
  rename(p.value=`p adj`)
ttdf %>% knitr::kable()

pair	diff	lwr	upr	p.value
20-15	5.6	0.2270417	10.9729583	0.0385024
25-15	7.8	2.4270417	13.1729583	0.0025948
30-15	11.8	6.4270417	17.1729583	0.0000190
35-15	1.0	-4.3729583	6.3729583	0.9797709
25-20	2.2	-3.1729583	7.5729583	0.7372438
30-20	6.2	0.8270417	11.5729583	0.0188936
35-20	-4.6	-9.9729583	0.7729583	0.1162970
30-25	4.0	-1.3729583	9.3729583	0.2101089
35-25	-6.8	-12.1729583	-1.4270417	0.0090646
35-30	-10.8	-16.1729583	-5.4270417	0.0000624

Now let’s join both tables and make a common plot:

compared <- df %>% 
  left_join(ttdf, by=join_by(pair), suffix=c(".student", ".tukey")) %>% 
  pivot_longer(-pair, names_to = c("stat", "test"), names_pattern = "(.*)\\.(student|tukey)$") %>% 
  pivot_wider(names_from = stat)

compared %>% knitr::kable()

pair	test	diff	lwr	upr	p.value
20-15	student	5.6	0.8740978	10.3259022	0.0257453
20-15	tukey	5.6	0.2270417	10.9729583	0.0385024
25-15	student	7.8	3.7398608	11.8601392	0.0021967
25-15	tukey	7.8	2.4270417	13.1729583	0.0025948
30-15	student	11.8	7.4246648	16.1753352	0.0002541
30-15	tukey	11.8	6.4270417	17.1729583	0.0000190
35-15	student	1.0	-3.5423014	5.5423014	0.6253800
35-15	tukey	1.0	-4.3729583	6.3729583	0.9797709
25-20	student	2.2	-1.6724395	6.0724395	0.2265324
25-20	tukey	2.2	-3.1729583	7.5729583	0.7372438
30-20	student	6.2	1.9982605	10.4017395	0.0093233
30-20	tukey	6.2	0.8270417	11.5729583	0.0188936
35-20	student	-4.6	-8.9753352	-0.2246648	0.0415629
35-20	tukey	-4.6	-9.9729583	0.7729583	0.1162970
30-25	student	4.0	0.5641312	7.4358688	0.0277266
30-25	tukey	4.0	-1.3729583	9.3729583	0.2101089
35-25	student	-6.8	-10.4461127	-3.1538873	0.0026133
35-25	tukey	-6.8	-12.1729583	-1.4270417	0.0090646
35-30	student	-10.8	-14.7941163	-6.8058837	0.0002496
35-30	tukey	-10.8	-16.1729583	-5.4270417	0.0000624

compared %>%
  ggplot(aes(x=diff, y=pair, color=test)) + 
  geom_point() + 
  geom_errorbar(aes(xmin=lwr, xmax=upr), width=0.5, position=position_dodge()) + 
  geom_vline(xintercept=0, color="red") +
  labs(x="Difference", y="Pair", title="95% pairwise confidence level")

The Family-Wise Error Rate

As expected, the Tukey’s test in the last plot shows larger confidence intervals, that is, it has reduced chances of a false positive (Type I Error). More specifically, Tukey’s test controls the family-wise error rate (FWER) — the probability of making any false positive in the full set of comparisons.

Let’s see why. If we set a confidence level of 0.95, it means that the probability of not making a Type I error on a single T-test is 0.95.

For 3 independent tests, the probability of no Type I error at all (in any of the tests) is: ${0.95}^{3} \approx 0.857$ So the chance of making at least one Type I error is: $1 - {0.95}^{3} \approx 0.143 (14.3%)$ That’s almost triple the risk you thought you were accepting! Furthermore, this risk increases exponentially with the number of comparisons. Given $n$ elements, the number of possible combinations of $k$ elements is given by the binomial coefficient: $(\binom{n}{k}) = \frac{n!}{k! (n - k)!}$ In R, the latter is provided by the choose(n, k) function:

choose(5, 2)

[1] 10

so, with increasing number of classes to be compared, this is what happens to the probability of committing at least one Type-I error:

2:15 %>% reduce(\(acc, k) {
    nt <- choose(k, 2)
    bind_rows(acc,
      list(n=k, p=1-0.95^nt)
    )
  }, .init=tibble()) %>% 
  ggplot(aes(x=n, y=p)) +
  geom_point() +
  geom_line() +
  ylim(0, 1) +
  labs(
    x="number of classes", 
    y="probability of Type-I Error")

And what happens if we change the confidence level? Let’s see, by creating a parametric plot similar to the ast one, but with different confidence levels. First we factor the last reduce opration into a function, FWER, that takes the confidence level as an argument. The function returns a tibble with the number of classes and the corresponding probability of Type I error.

FWER <- function(levels, conf.int=0.95) {
  reduce(levels, \(acc, k) {
    nt <- choose(k, 2)
    bind_rows(acc,
      list(n=k, p=1-conf.int^nt)
    )
  }, .init=tibble()) 
}

Then we apply the FWER function to a set of confidence levels, and join the results into a single tibble via the usual reduce, and finally, we plot the results:

cl <- c(0.9, 0.95, 0.99, 0.995, 0.999) 
N <- 2:15
cl %>% 
  reduce(\(acc, ci) {
    fwer <- FWER(N, ci) %>% rename(!!paste0("cl-", ci):=p)
    left_join(acc, fwer, by=join_by(n))
  }, .init=tibble(n=N)) %>% 
  pivot_longer(-n, names_to = c(NA, "cl"), names_pattern="(cl-)(.*)") %>% 
  ggplot(aes(x=n, y=value, color=cl)) + 
  geom_point() + 
  geom_line() + 
  ylim(0, 1) + 
  labs(
    x="number of classes", 
    y="probability of Type-I Error", 
    color="Conf. level")

That’s all, folks!