library(tidyverse)
library(adas.utils)
Tukey vs. Student
To compare two samples, or groups, we can use a T-test. But if we want to compare more than two groups, we need to use Tukey’s test. In this post we investigate the reason why a Tukey’s test is more appropriate and robust than a set of pairwise T-tests for all possible combinations of groups. This is also an excuse to illustrate the power of purrr
and dplyr
packages, specifically for the use of map
/reduce
, join_left
, and pivot_longer
/pivot_wider
functions.
In this example, we are using the packages tidyverse
and adas.utils
version 1.1.4 (see https://github.com/pbosetti/adas.utils)
Repeated T-test vs. Tukey’s test
The dataset
Let us compare the result of a Tukey’s test with a repeated Student’s T-test on all combinations. We consider the cotton
dataset, which is included in the adas.utils
package from version 1.1.4. The dataset contains the tensile strength of mixed cotton-synthetic yarns with different cotton content:
%>%
cotton ggplot(aes(x=Cotton, y=Strength, group=Cotton)) +
geom_boxplot()
Inference on Strength
Now we want to compare all the possibile combinations of treatmentswith a set of pairwise T-tests.
First, we create the list of pairwise combinations, sorting each pair in descending order, as it is done by the TukeyHSD
function:
<- levels(cotton$Cotton) %>%
lvl combn(2, FUN=sort, decreasing=T) %>%
as_tibble(.name_repair="minimal") %>%
as.list() %>% glimpse()
List of 10
$ : chr [1:2] "20" "15"
$ : chr [1:2] "25" "15"
$ : chr [1:2] "30" "15"
$ : chr [1:2] "35" "15"
$ : chr [1:2] "25" "20"
$ : chr [1:2] "30" "20"
$ : chr [1:2] "35" "20"
$ : chr [1:2] "30" "25"
$ : chr [1:2] "35" "25"
$ : chr [1:2] "35" "30"
Now, for each pair we do a T-test on the corresponding cotton
data-frame subset, and accumulate into a new tibble the values of interest. We get the df
table that is analogous to the TukeyHSD
output:
<- lvl %>% reduce(\(acc, pair) {
df <- cotton %>%
tt filter(Cotton %in% pair) %>%
t.test(Strength~Cotton, data=., var.equal=TRUE)
bind_rows(acc, list(
pair = paste0(pair[1], "-", pair[2]),
diff = -median(tt$conf.int),
lwr = -tt$conf.int[2],
upr = -tt$conf.int[1],
p.value = tt$p.value
)).init=tibble())
},
%>% knitr::kable() df
pair | diff | lwr | upr | p.value |
---|---|---|---|---|
20-15 | 5.6 | 0.8740978 | 10.3259022 | 0.0257453 |
25-15 | 7.8 | 3.7398608 | 11.8601392 | 0.0021967 |
30-15 | 11.8 | 7.4246648 | 16.1753352 | 0.0002541 |
35-15 | 1.0 | -3.5423014 | 5.5423014 | 0.6253800 |
25-20 | 2.2 | -1.6724395 | 6.0724395 | 0.2265324 |
30-20 | 6.2 | 1.9982605 | 10.4017395 | 0.0093233 |
35-20 | -4.6 | -8.9753352 | -0.2246648 | 0.0415629 |
30-25 | 4.0 | 0.5641312 | 7.4358688 | 0.0277266 |
35-25 | -6.8 | -10.4461127 | -3.1538873 | 0.0026133 |
35-30 | -10.8 | -14.7941163 | -6.8058837 | 0.0002496 |
To be compared with Tukey’s values:
<- TukeyHSD(aov(lm(Strength~Cotton, data=cotton)))$Cotton %>%
ttdf as.data.frame() %>%
rownames_to_column(var="pair") %>%
rename(p.value=`p adj`)
%>% knitr::kable() ttdf
pair | diff | lwr | upr | p.value |
---|---|---|---|---|
20-15 | 5.6 | 0.2270417 | 10.9729583 | 0.0385024 |
25-15 | 7.8 | 2.4270417 | 13.1729583 | 0.0025948 |
30-15 | 11.8 | 6.4270417 | 17.1729583 | 0.0000190 |
35-15 | 1.0 | -4.3729583 | 6.3729583 | 0.9797709 |
25-20 | 2.2 | -3.1729583 | 7.5729583 | 0.7372438 |
30-20 | 6.2 | 0.8270417 | 11.5729583 | 0.0188936 |
35-20 | -4.6 | -9.9729583 | 0.7729583 | 0.1162970 |
30-25 | 4.0 | -1.3729583 | 9.3729583 | 0.2101089 |
35-25 | -6.8 | -12.1729583 | -1.4270417 | 0.0090646 |
35-30 | -10.8 | -16.1729583 | -5.4270417 | 0.0000624 |
Now let’s join both tables and make a common plot:
<- df %>%
compared left_join(ttdf, by=join_by(pair), suffix=c(".student", ".tukey")) %>%
pivot_longer(-pair, names_to = c("stat", "test"), names_pattern = "(.*)\\.(student|tukey)$") %>%
pivot_wider(names_from = stat)
%>% knitr::kable() compared
pair | test | diff | lwr | upr | p.value |
---|---|---|---|---|---|
20-15 | student | 5.6 | 0.8740978 | 10.3259022 | 0.0257453 |
20-15 | tukey | 5.6 | 0.2270417 | 10.9729583 | 0.0385024 |
25-15 | student | 7.8 | 3.7398608 | 11.8601392 | 0.0021967 |
25-15 | tukey | 7.8 | 2.4270417 | 13.1729583 | 0.0025948 |
30-15 | student | 11.8 | 7.4246648 | 16.1753352 | 0.0002541 |
30-15 | tukey | 11.8 | 6.4270417 | 17.1729583 | 0.0000190 |
35-15 | student | 1.0 | -3.5423014 | 5.5423014 | 0.6253800 |
35-15 | tukey | 1.0 | -4.3729583 | 6.3729583 | 0.9797709 |
25-20 | student | 2.2 | -1.6724395 | 6.0724395 | 0.2265324 |
25-20 | tukey | 2.2 | -3.1729583 | 7.5729583 | 0.7372438 |
30-20 | student | 6.2 | 1.9982605 | 10.4017395 | 0.0093233 |
30-20 | tukey | 6.2 | 0.8270417 | 11.5729583 | 0.0188936 |
35-20 | student | -4.6 | -8.9753352 | -0.2246648 | 0.0415629 |
35-20 | tukey | -4.6 | -9.9729583 | 0.7729583 | 0.1162970 |
30-25 | student | 4.0 | 0.5641312 | 7.4358688 | 0.0277266 |
30-25 | tukey | 4.0 | -1.3729583 | 9.3729583 | 0.2101089 |
35-25 | student | -6.8 | -10.4461127 | -3.1538873 | 0.0026133 |
35-25 | tukey | -6.8 | -12.1729583 | -1.4270417 | 0.0090646 |
35-30 | student | -10.8 | -14.7941163 | -6.8058837 | 0.0002496 |
35-30 | tukey | -10.8 | -16.1729583 | -5.4270417 | 0.0000624 |
%>%
compared ggplot(aes(x=diff, y=pair, color=test)) +
geom_point() +
geom_errorbar(aes(xmin=lwr, xmax=upr), width=0.5, position=position_dodge()) +
geom_vline(xintercept=0, color="red") +
labs(x="Difference", y="Pair", title="95% pairwise confidence level")
The Family-Wise Error Rate
As expected, the Tukey’s test in the last plot shows larger confidence intervals, that is, it has reduced chances of a false positive (Type I Error). More specifically, Tukey’s test controls the family-wise error rate (FWER) — the probability of making any false positive in the full set of comparisons.
Let’s see why. If we set a confidence level of 0.95, it means that the probability of not making a Type I error on a single T-test is 0.95.
For 3 independent tests, the probability of no Type I error at all (in any of the tests) is: choose(n, k)
function:
choose(5, 2)
[1] 10
so, with increasing number of classes to be compared, this is what happens to the probability of committing at least one Type-I error:
2:15 %>% reduce(\(acc, k) {
<- choose(k, 2)
nt bind_rows(acc,
list(n=k, p=1-0.95^nt)
).init=tibble()) %>%
}, ggplot(aes(x=n, y=p)) +
geom_point() +
geom_line() +
ylim(0, 1) +
labs(
x="number of classes",
y="probability of Type-I Error")
And what happens if we change the confidence level? Let’s see, by creating a parametric plot similar to the ast one, but with different confidence levels. First we factor the last reduce
opration into a function, FWER
, that takes the confidence level as an argument. The function returns a tibble with the number of classes and the corresponding probability of Type I error.
<- function(levels, conf.int=0.95) {
FWER reduce(levels, \(acc, k) {
<- choose(k, 2)
nt bind_rows(acc,
list(n=k, p=1-conf.int^nt)
).init=tibble())
}, }
Then we apply the FWER
function to a set of confidence levels, and join the results into a single tibble via the usual reduce
, and finally, we plot the results:
<- c(0.9, 0.95, 0.99, 0.995, 0.999)
cl <- 2:15
N %>%
cl reduce(\(acc, ci) {
<- FWER(N, ci) %>% rename(!!paste0("cl-", ci):=p)
fwer left_join(acc, fwer, by=join_by(n))
.init=tibble(n=N)) %>%
}, pivot_longer(-n, names_to = c(NA, "cl"), names_pattern="(cl-)(.*)") %>%
ggplot(aes(x=n, y=value, color=cl)) +
geom_point() +
geom_line() +
ylim(0, 1) +
labs(
x="number of classes",
y="probability of Type-I Error",
color="Conf. level")
That’s all, folks!