2

Is there a way in dplyr to compare groups with each other? Here a concrete example: I would like to apply a t-test to the following combinations: a vs b, a vs c and b vs c

set.seed(1)
tibble(value = c(rnorm(1000, 1, 1), rnorm(1000, 5, 1), rnorm(1000, 10,1)),
       group=c(rep("a", 1000), rep("b", 1000), rep("c", 1000))) %>%
   nest(value)

# A tibble: 3 x 2
  group data                
  <chr> <list>              
1 a     <tibble [1,000 × 1]>
2 b     <tibble [1,000 × 1]>
3 c     <tibble [1,000 × 1]>

If dplyr provides no solution, i would also be happy for other approaches...maybe data.table?

zx8754
  • 52,746
  • 12
  • 114
  • 209
MrNetherlands
  • 920
  • 7
  • 14
  • `dplyr` supports looking at one group at a time; it is possible to make another structure (a DOE, of sorts) to support comparing different groups in data, but it is only using `dplyr` tangentially. My usual "go-to" for doing group-wise comparisons is to break it up with something like `spl <- split(mtcars, mtcars$cyl)`, and then externally managing comparisons based on the indices (levels of the variable), such as `t.test(spl[["4"]]$disp, spl[["8"]]$disp)`. – r2evans Mar 19 '18 at 16:03
  • Possible duplicate of https://stackoverflow.com/questions/33856920/apply-a-function-over-all-combinations-of-a-list-of-vectors-r – zx8754 Mar 19 '18 at 16:08
  • @r2evans Could you explain in more detail what you mean with DOE of sorts? – MrNetherlands Mar 19 '18 at 19:15
  • `expand.grid(a=1:3, b=1:3)` gives a full factorial expansion of comparing three models; since this includes self-comparison, we can filter out where `a==b`, but it's still a factorial design. For each row of this frame, compare `model[[a]]` with `model[[b]]`. DOEs ([designs of experiments](https://en.wikipedia.org/wiki/Design_of_experiments)) often try to reduce the total number of experiments necessary (`n`) in order to isolate relationships between the `k` different factors, but reduction can only be done with numeric changes; categorical factors (such as each model) must be categorical. – r2evans Mar 19 '18 at 19:55

1 Answers1

4

Here's a base-R / tidyverse approach (which is somewhat manual, but imho ok for this task):

combn(df$group, 2, FUN = function(g) 
  t.test(filter(df, group == g[1]) %>% unnest %$% value , 
         filter(df, group == g[2]) %>% unnest %$% value ), 
  simplify = FALSE)

# [[1]]
# 
# Welch Two Sample t-test
# 
# data:  filter(df, group == g[1]) %>% unnest %$% value and filter(df, group == g[2]) %>% unnest %$% value
# t = -86.114, df = 1998, p-value < 2.2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#   -4.086376 -3.904396
# sample estimates:
#   mean of x mean of y 
# 0.9883519 4.9837381 
# 
# 
# [[2]]
# 
# Welch Two Sample t-test
# 
# data:  filter(df, group == g[1]) %>% unnest %$% value and filter(df, group == g[2]) %>% unnest %$% value
# t = -195.4, df = 1998, p-value < 2.2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#   -9.117558 -8.936356
# sample estimates:
#   mean of x  mean of y 
# 0.9883519 10.0153090 
# 
# 
# [[3]]
# 
# Welch Two Sample t-test
# 
# data:  filter(df, group == g[1]) %>% unnest %$% value and filter(df, group == g[2]) %>% unnest %$% value
# t = -108.65, df = 1997.9, p-value < 2.2e-16
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
#   -5.122395 -4.940747
# sample estimates:
#   mean of x mean of y 
# 4.983738 10.015309 
talat
  • 68,970
  • 21
  • 126
  • 157