Tukey vs. Student

R
tidyverse
inference
Tukey
Author

Paolo Bosetti

Published

April 10, 2025

Abstract

To compare two samples, or groups, we can use a T-test. But if we want to compare more than two groups, we need to use Tukey’s test. In this post we investigate the reason why a Tukey’s test is more appropriate and robust than a set of pairwise T-tests for all possible combinations of groups. This is also an excuse to illustrate the power of purrr and dplyr packages, specifically for the use of map/reduce, join_left, and pivot_longer/pivot_wider functions.

Packages that we need

In this example, we are using the packages tidyverse and adas.utils version 1.1.4 (see https://github.com/pbosetti/adas.utils)

library(tidyverse)
library(adas.utils)

Repeated T-test vs. Tukey’s test

The dataset

Let us compare the result of a Tukey’s test with a repeated Student’s T-test on all combinations. We consider the cotton dataset, which is included in the adas.utils package from version 1.1.4. The dataset contains the tensile strength of mixed cotton-synthetic yarns with different cotton content:

cotton %>% 
  ggplot(aes(x=Cotton, y=Strength, group=Cotton)) +
  geom_boxplot()

Inference on Strength

Now we want to compare all the possibile combinations of treatmentswith a set of pairwise T-tests.

First, we create the list of pairwise combinations, sorting each pair in descending order, as it is done by the TukeyHSD function:

lvl <- levels(cotton$Cotton) %>% 
  combn(2, FUN=sort, decreasing=T) %>% 
  as_tibble(.name_repair="minimal") %>% 
  as.list() %>% glimpse()
List of 10
 $ : chr [1:2] "20" "15"
 $ : chr [1:2] "25" "15"
 $ : chr [1:2] "30" "15"
 $ : chr [1:2] "35" "15"
 $ : chr [1:2] "25" "20"
 $ : chr [1:2] "30" "20"
 $ : chr [1:2] "35" "20"
 $ : chr [1:2] "30" "25"
 $ : chr [1:2] "35" "25"
 $ : chr [1:2] "35" "30"

Now, for each pair we do a T-test on the corresponding cotton data-frame subset, and accumulate into a new tibble the values of interest. We get the df table that is analogous to the TukeyHSD output:

df <- lvl %>% reduce(\(acc, pair) {
  tt <- cotton %>% 
    filter(Cotton %in% pair) %>% 
    t.test(Strength~Cotton, data=., var.equal=TRUE)
  bind_rows(acc, list(
    pair = paste0(pair[1], "-", pair[2]),
    diff = -median(tt$conf.int),
    lwr = -tt$conf.int[2],
    upr = -tt$conf.int[1],
    p.value = tt$p.value
  ))
}, .init=tibble()) 

df %>% knitr::kable()
pair diff lwr upr p.value
20-15 5.6 0.8740978 10.3259022 0.0257453
25-15 7.8 3.7398608 11.8601392 0.0021967
30-15 11.8 7.4246648 16.1753352 0.0002541
35-15 1.0 -3.5423014 5.5423014 0.6253800
25-20 2.2 -1.6724395 6.0724395 0.2265324
30-20 6.2 1.9982605 10.4017395 0.0093233
35-20 -4.6 -8.9753352 -0.2246648 0.0415629
30-25 4.0 0.5641312 7.4358688 0.0277266
35-25 -6.8 -10.4461127 -3.1538873 0.0026133
35-30 -10.8 -14.7941163 -6.8058837 0.0002496

To be compared with Tukey’s values:

ttdf <- TukeyHSD(aov(lm(Strength~Cotton, data=cotton)))$Cotton %>% 
  as.data.frame() %>% 
  rownames_to_column(var="pair") %>% 
  rename(p.value=`p adj`)
ttdf %>% knitr::kable()
pair diff lwr upr p.value
20-15 5.6 0.2270417 10.9729583 0.0385024
25-15 7.8 2.4270417 13.1729583 0.0025948
30-15 11.8 6.4270417 17.1729583 0.0000190
35-15 1.0 -4.3729583 6.3729583 0.9797709
25-20 2.2 -3.1729583 7.5729583 0.7372438
30-20 6.2 0.8270417 11.5729583 0.0188936
35-20 -4.6 -9.9729583 0.7729583 0.1162970
30-25 4.0 -1.3729583 9.3729583 0.2101089
35-25 -6.8 -12.1729583 -1.4270417 0.0090646
35-30 -10.8 -16.1729583 -5.4270417 0.0000624

Now let’s join both tables and make a common plot:

compared <- df %>% 
  left_join(ttdf, by=join_by(pair), suffix=c(".student", ".tukey")) %>% 
  pivot_longer(-pair, names_to = c("stat", "test"), names_pattern = "(.*)\\.(student|tukey)$") %>% 
  pivot_wider(names_from = stat)

compared %>% knitr::kable()
pair test diff lwr upr p.value
20-15 student 5.6 0.8740978 10.3259022 0.0257453
20-15 tukey 5.6 0.2270417 10.9729583 0.0385024
25-15 student 7.8 3.7398608 11.8601392 0.0021967
25-15 tukey 7.8 2.4270417 13.1729583 0.0025948
30-15 student 11.8 7.4246648 16.1753352 0.0002541
30-15 tukey 11.8 6.4270417 17.1729583 0.0000190
35-15 student 1.0 -3.5423014 5.5423014 0.6253800
35-15 tukey 1.0 -4.3729583 6.3729583 0.9797709
25-20 student 2.2 -1.6724395 6.0724395 0.2265324
25-20 tukey 2.2 -3.1729583 7.5729583 0.7372438
30-20 student 6.2 1.9982605 10.4017395 0.0093233
30-20 tukey 6.2 0.8270417 11.5729583 0.0188936
35-20 student -4.6 -8.9753352 -0.2246648 0.0415629
35-20 tukey -4.6 -9.9729583 0.7729583 0.1162970
30-25 student 4.0 0.5641312 7.4358688 0.0277266
30-25 tukey 4.0 -1.3729583 9.3729583 0.2101089
35-25 student -6.8 -10.4461127 -3.1538873 0.0026133
35-25 tukey -6.8 -12.1729583 -1.4270417 0.0090646
35-30 student -10.8 -14.7941163 -6.8058837 0.0002496
35-30 tukey -10.8 -16.1729583 -5.4270417 0.0000624
compared %>%
  ggplot(aes(x=diff, y=pair, color=test)) + 
  geom_point() + 
  geom_errorbar(aes(xmin=lwr, xmax=upr), width=0.5, position=position_dodge()) + 
  geom_vline(xintercept=0, color="red") +
  labs(x="Difference", y="Pair", title="95% pairwise confidence level")

The Family-Wise Error Rate

The Family-Wise Error Rate

As expected, the Tukey’s test in the last plot shows larger confidence intervals, that is, it has reduced chances of a false positive (Type I Error). More specifically, Tukey’s test controls the family-wise error rate (FWER) — the probability of making any false positive in the full set of comparisons.

Let’s see why. If we set a confidence level of 0.95, it means that the probability of not making a Type I error on a single T-test is 0.95.

For 3 independent tests, the probability of no Type I error at all (in any of the tests) is: 0.9530.857 So the chance of making at least one Type I error is: 10.9530.143(14.3%) That’s almost triple the risk you thought you were accepting! Furthermore, this risk increases exponentially with the number of comparisons. Given n elements, the number of possible combinations of k elements is given by the binomial coefficient: (nk)=n!k!(nk)! In R, the latter is provided by the choose(n, k) function:

choose(5, 2)
[1] 10

so, with increasing number of classes to be compared, this is what happens to the probability of committing at least one Type-I error:

2:15 %>% reduce(\(acc, k) {
    nt <- choose(k, 2)
    bind_rows(acc,
      list(n=k, p=1-0.95^nt)
    )
  }, .init=tibble()) %>% 
  ggplot(aes(x=n, y=p)) +
  geom_point() +
  geom_line() +
  ylim(0, 1) +
  labs(
    x="number of classes", 
    y="probability of Type-I Error")

And what happens if we change the confidence level? Let’s see, by creating a parametric plot similar to the ast one, but with different confidence levels. First we factor the last reduce opration into a function, FWER, that takes the confidence level as an argument. The function returns a tibble with the number of classes and the corresponding probability of Type I error.

FWER <- function(levels, conf.int=0.95) {
  reduce(levels, \(acc, k) {
    nt <- choose(k, 2)
    bind_rows(acc,
      list(n=k, p=1-conf.int^nt)
    )
  }, .init=tibble()) 
}

Then we apply the FWER function to a set of confidence levels, and join the results into a single tibble via the usual reduce, and finally, we plot the results:

cl <- c(0.9, 0.95, 0.99, 0.995, 0.999) 
N <- 2:15
cl %>% 
  reduce(\(acc, ci) {
    fwer <- FWER(N, ci) %>% rename(!!paste0("cl-", ci):=p)
    left_join(acc, fwer, by=join_by(n))
  }, .init=tibble(n=N)) %>% 
  pivot_longer(-n, names_to = c(NA, "cl"), names_pattern="(cl-)(.*)") %>% 
  ggplot(aes(x=n, y=value, color=cl)) + 
  geom_point() + 
  geom_line() + 
  ylim(0, 1) + 
  labs(
    x="number of classes", 
    y="probability of Type-I Error", 
    color="Conf. level")

That’s all, folks!