A compact way to perform multiple pairwise tests (e.g. t-test) with a single variable split in multiple categories in long-format

Question

I am interested in performing multiple tests for a single variable with an associated factor that split the values into multiple groups. It is related to this question and, actually, I would like to get a solution of that kind but it is not exactly the same.

In my case, I have a single variable and multiple groups (eventually many). Expanding on this example:

library(reshape)

# Create a dataset
mu=34
stdv=5
Location=rep(c("Area_A","Area_B","Area_C"),5) 
distro=rnorm(length(Location),mu,stdv) 
id=seq(1:length(Location))
sample_long=data.frame(id,Location,distro)
sample_long
   id Location   distro
1   1   Area_A 34.95737
2   2   Area_B 31.30298
3   3   Area_C 35.86569
4   4   Area_A 40.45378
5   5   Area_B 36.12060
6   6   Area_C 28.29649
7   7   Area_A 30.64495
8   8   Area_B 29.70668
9   9   Area_C 33.22874
10 10   Area_A 25.29148
11 11   Area_B 32.35511
12 12   Area_C 34.69159
13 13   Area_A 26.89791
14 14   Area_B 35.30717
15 15   Area_C 40.64628

I would like to perform all-against-all tests among Areas, i.e. test(Area_A,Area_B), test(Area_A,Area_C) and test(Area_B,Area_C) (in a more general case, all the i<j possible tests).

A simple way to go is to transform the data into wide format:

# Reshape to wide format
sample_wide=reshape(sample_long,direction="wide",idvar="id",timevar="Location")
sample_wide
   id distro.Area_A distro.Area_B distro.Area_C
1   1      34.95737            NA            NA
2   2            NA      31.30298            NA
3   3            NA            NA      35.86569
4   4      40.45378            NA            NA
5   5            NA      36.12060            NA
6   6            NA            NA      28.29649
7   7      30.64495            NA            NA
8   8            NA      29.70668            NA
9   9            NA            NA      33.22874
10 10      25.29148            NA            NA
11 11            NA      32.35511            NA
12 12            NA            NA      34.69159
13 13      26.89791            NA            NA
14 14            NA      35.30717            NA
15 15            NA            NA      40.64628

and then loop across all-against-all columns, for which I've seen several approximations more R-like than the following one in which I'm using for loops:

# Now compute the test
test.out=list()
k=0
for(i in 2:(dim(sample_wide)[2]-1)){ # All against  all var groups
  for(j in (i+1):dim(sample_wide)[2]){
    k=k+1
    test.out[[k]]=t.test(sample_wide[,i], 
                    sample_wide[,j]) # store results in a list
  }
}

But my question is not about which is the best solution given the wide format, but whether it is possible to find a solution for the problem working from the original long format, in line with the solutions found for the links I provided above that use dplyr, broom, etc.

I guess you can `gather` into 'long' format and then do `pairwise.t.test` — akrun, Aug 07 '19 at 13:59
@akrun I think they're trying to start with the long format they already have, so no need to gather — camille, Aug 07 '19 at 14:07
if your test is a `t.test`, what you probably want to do is a Tukey post-hoc test: ```r TukeyHSD(aov(distro ~ Location, data=sample_long)) ``` Then, you don't need to loop nor anything else, just call the function. — ssayols, Aug 07 '19 at 15:18
Good point @ssayols, but my aim here is more to learn which programming styles are more appropriated for which problems in R, should I perhaps edit the title to remove the emphasis on the test? It was just the way in which I ended up thinking on this problem. — Alf Pascu, Aug 07 '19 at 16:56

camille · Accepted Answer · 2019-08-07T15:55:47.900

This is a little trickier and less straightforward than I hoped. You can first figure out the combinations of locations and, to make it a little simpler, save that in a lookup table. I turned that into a long shape with an ID for each pair, which I'll use as a grouping variable on the data.

library(dplyr)
library(tidyr)
library(purrr)

set.seed(111)
# same data creation code

grps <- as.data.frame(t(combn(levels(sample_long$Location), 2))) %>%
  mutate(pair = row_number()) %>%
  gather(key, value = loc, -pair) %>%
  select(-key)

grps
#>   pair    loc
#> 1    1 Area_A
#> 2    2 Area_A
#> 3    3 Area_B
#> 4    1 Area_B
#> 5    2 Area_C
#> 6    3 Area_C

Joining the lookup to the data frame doubles the rows—that will differ depending on how many levels you're combining. Note also I dropped your ID column since it didn't seem necessary right now. Nest, do the t-test, and tidy the results.

sample_long %>%
  select(-id) %>%
  inner_join(grps, by = c("Location" = "loc")) %>%
  group_by(pair) %>%
  nest() %>%
  mutate(t_test = map(data, ~t.test(distro ~ Location, data = .)),
         tidied = map(t_test, broom::tidy)) %>%
  unnest(tidied)
#> # A tibble: 3 x 13
#>    pair data  t_test estimate estimate1 estimate2 statistic p.value
#>   <int> <lis> <list>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>
#> 1     1 <tib… <htes…   -0.921      31.8      32.7    -0.245   0.816
#> 2     2 <tib… <htes…   -1.48       31.8      33.3    -0.383   0.716
#> 3     3 <tib… <htes…   -0.563      32.7      33.3    -0.305   0.769
#> # … with 5 more variables: parameter <dbl>, conf.low <dbl>,
#> #   conf.high <dbl>, method <chr>, alternative <chr>

If you needed to, you could do something to show which locations are in each pair—joining with the lookup table would be one way to do this.

I'm realizing also that you mentioned wanting to use broom functions afterwards, but didn't specify that you need a broom::tidy call. In that case, just drop the last 2 lines.

`tidyverse` verbal diarrhea makes *some* problems more approachable/readable/maintainable/etc, but certainly not all. Here's a good example. — ssayols, Aug 07 '19 at 15:13
@ssayols if you think it's bad, you can downvote. But your point above is a good one that if all they want is to compare t-tests, they're better off with a post-hoc test. — camille, Aug 07 '19 at 15:54
don't get me wrong, I just meant the code above is a good example of the Law of instrument ("if all you have is a hammer, everything looks like a nail"). — ssayols, Aug 08 '19 at 07:50

ssayols · Answer 2 · 2019-08-08T08:06:49.887

0

A little bit of base R will do the trick:

combn(x=unique(sample_long$Location), m=2, simplify=FALSE,
      FUN=function(l) { 
        t.test(distro ~ Location, data=subset(sample_long, Location %in% l))
      })

combn will generate all combinations of elements of x taken m at a time (sic). Combined with subset, you will apply your test to subsets of your data.frame.

edited Aug 08 '19 at 08:06

answered Aug 07 '19 at 14:57

ssayols

790
6
10

I cannot make your code work (assuming that with test you mean for instance t.test). – Alf Pascu Aug 07 '19 at 18:05

A compact way to perform multiple pairwise tests (e.g. t-test) with a single variable split in multiple categories in long-format

2 Answers2