3

I would like to perform pairwise comparisons (using t tests) between each species in the iris dataset to see which species differ significantly in which variables. That is, each pairwise comparison would compare all measurement values of one species in a given variable against all measurement values of another species in the same variable. Listed below are all possible pairwise comparisons with the iris dataset.

data(iris)
setosa.only <- iris[iris$Species == "setosa", ]
versicolor.only <- iris[iris$Species == "versicolor", ]
virginica.only <- iris[iris$Species == "virginica", ]

# setosa vs versicolor
t.test(setosa.only$Sepal.Length, versicolor.only$Sepal.Length)
t.test(setosa.only$Sepal.Width, versicolor.only$Sepal.Width)
t.test(setosa.only$Petal.Length, versicolor.only$Petal.Length)
t.test(setosa.only$Petal.Width, versicolor.only$Petal.Width)

# setosa vs virginica
t.test(setosa.only$Sepal.Length, virginica.only$Sepal.Length)
t.test(setosa.only$Sepal.Width, virginica.only$Sepal.Width)
t.test(setosa.only$Petal.Length, virginica.only$Petal.Length)
t.test(setosa.only$Petal.Width, virginica.only$Petal.Width)

# versicolor vs virginica
t.test(versicolor.only$Sepal.Length, virginica.only$Sepal.Length)
t.test(versicolor.only$Sepal.Width, virginica.only$Sepal.Width)
t.test(versicolor.only$Petal.Length, virginica.only$Petal.Length)
t.test(versicolor.only$Petal.Width, virginica.only$Petal.Width)

Such pairwise comparisons are easy to perform one by one with a small dataset such as iris (which has only 12 possible comparisons), but I would like to apply this to larger datasets with dozens of species and variables (and thus hundreds of possible comparisons). How could I do the above comparisons with a single or a few commands to apply them to larger datasets? With limited knowledge of the R language, I have not been able to figure out how to do this and would be grateful if anyone has suggestions.

In addition, I woud like to get an output summarizing all pairwise comparisons. It could be a matrix with TRUE or FALSE (or something equivalent like 1/0 or Y/N) indicating which species differ significantly in which variables (i.e., TRUE indicating species pairs that met the t test, considering p = 0.05). Such a matrix may be difficult to interpret if it contains all species and all variables simultaneously, thus it could be one matrix per variable. For example, the desired output matrix resulting from the comparisons of Sepal.Length would be something like:

            setosa   versicolor   virginica
setosa      NA       YES          YES   
versicolor  YES      NA           YES   
virginica   YES      YES          NA   

Alternatively, the output could be an array like the one which returns when calling the code below:

tapply(X = iris$Sepal.Length, INDEX = iris$Species, FUN = summary)
goshawk
  • 73
  • 5

1 Answers1

1

We can use combn to create combinations of unique values in Species column. For each combination apply t.test to every column of the dataset.

res <- combn(unique(iris$Species), 2, function(x) {
  data1 <- subset(iris, Species == x[1], -Species)
  data2 <- subset(iris, Species == x[2], -Species)
  out <- data.frame(species1 = x[1], species2 = x[2], column = names(data1))
  out$t.test <- Map(t.test, data1, data2)
  out
}, simplify = FALSE)

res is a list of dataframe, combine it into one dataframe

res <- dplyr::bind_rows(res)

and then you can extract the p.value from each t.test output.

unname(sapply(res$t.test, `[[`, "p.value"))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213