dplyr summarise multiple columns using t.test

Question

Is it possible somehow to do a t.test over multiple variables against the same categorical variable without going through a reshaping of the dataset as follows?

data(mtcars)
library(dplyr)
library(tidyr)
j <- mtcars %>% gather(var, val, disp:qsec)
t <- j %>% group_by(var) %>% do(te = t.test(val ~ vs, data = .))

t %>% summarise(p = te$p.value)

I´ve tried using

mtcars %>% summarise_each_(funs = (t.test(. ~ vs))$p.value, vars = disp:qsec)

but it throws an error.

Bonus: How can t %>% summarise(p = te$p.value) also include the name of the grouping variable?

This may be a partial solution (void of summarise portion) by data.table : (step1) library(data.table) (step2) setDT(j) (Step3) j[, te := t.test(value~vs), by=variable][] — KFB, Oct 08 '14 at 01:38

jazzurro · Accepted Answer · 2014-10-12T02:48:20.547

19

After all discussions with @aosmith and @Misha, here is one approach. As @aosmith wrote in his/her comments, You want to do the following.

mtcars %>%
    summarise_each(funs(t.test(.[vs == 0], .[vs == 1])$p.value), vars = disp:qsec)

#         vars1        vars2      vars3        vars4        vars5
#1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06

vs is either 0 or 1 (group). If you want to run a t-test between the two groups in a variable (e.g., dips), it seems that you need to subset data as @aosmith suggested. I would like to say thank you for the contribution.

What I originally suggested works in another situation, in which you simply compare two columns. Here is sample data and codes.

foo <- data.frame(country = "Iceland",
                  year = 2014,
                  id = 1:30,
                  A = sample.int(1e5, 30, replace = TRUE),
                  B = sample.int(1e5, 30, replace = TRUE),
                  C = sample.int(1e5, 30, replace = TRUE),
                  stringsAsFactors = FALSE)

If you want to run t-tests for the A-C, and B-C combination, the following would be one way.

foo2 <- foo %>%
        summarise_each(funs(t.test(., C, pair = TRUE)$p.value), vars = A:B) 

names(foo2) <- colnames(foo[4:5])

#          A         B
#1 0.2937979 0.5316822

edited Oct 12 '14 at 02:48

answered Oct 08 '14 at 01:17

jazzurro

23,179
35
66
76

Those p-values don't look quite right to me. If using `t.test` without a formula, `x` and `y` should be vectors of the response from each group. Try something like `summarise_each(funs(t.test(.[vs == 0], .[vs == 1])$p.value), vars = disp:qsec)` – aosmith Oct 08 '14 at 15:27
@aosmith Hey, thanks for this. I did not looked into mtcars carefully. Now I realize that vs is a binary variable (group). My apology. Your suggestion is the right way since you compare the two groups for each variable. Why don't you leave it as an answer? I'll drop mine. – jazzurro Oct 08 '14 at 15:59
foo <- mtcars %>% + summarise_each(funs(t.test(.[vs == 0], .[vs == 1])$p.value), vars = disp:qsec) Error in summarise_impl(.data, dots) : could not find function "t.test" What dplyr version are you using? I´m on 0.3 – Misha Oct 08 '14 at 16:07
@Misha You may need to have `dplyr` 0.3. aosmith's code is working on my machine. – jazzurro Oct 08 '14 at 16:09
It worked when I updated to 0.3.0.9000. Thx. Although I dont find this particular aspect of dplyr very intuitive. – Misha Oct 08 '14 at 16:22
1

@aosmith - Are you able to make it work using formula in t.test? : mtcars %>% summarise_each(funs(t.test(.~vs)$p.value), vars = disp:qsec) - it does not work for me. – Misha Oct 08 '14 at 16:29
@Misha This is beyond my knowledge. But this is what I intuitively think. Your way is like asking R to subset data using `vs`; R uses two columns. But, that may be something `summarise_each` does not like given it focuses on one column only. Then, I think we want to help R by dividing a column with itself. – jazzurro Oct 08 '14 at 16:41
@Misah Using the formula eludes me, as well. @jazurro, don't hesitate to update your answer with the `t.test` correction. – aosmith Oct 08 '14 at 16:48
@aosmith Thank you very much for your contribution. Now I updated our answer. – jazzurro Oct 10 '14 at 03:31
Interestingly I haven’t found a way of doing the same for a single column. That is, the above code works (even for a single column) but all my attempts to use `summarize` instead of `summarize_each` have failed. In particular, using `filter` inside `summarize` doesn’t seem to work. – Konrad Rudolph Sep 28 '15 at 13:03
@KonradRudolph Thank you for your comment. I have been thinking what code you might have run. I think you meant `mtcars %>% summarise(out = t.test(disp[vs == 0], disp[vs == 1])$p.value) ` works, even you focus on one column. You can still divide the column into two pieces. But, you probably tried something like this (`mtcars %>% summarise(out = t.test(filter(disp, vs == 0), filter(disp, vs == 1))$p.value)`) and found the code returned an error. Is that right? – jazzurro Sep 29 '15 at 00:55
@KonradRudolph My alternative was to use `subset`. `mtcars %>% summarise(out = t.test(subset(disp, vs == 0), subset(disp, vs == 1))$p.value)` – jazzurro Sep 29 '15 at 00:56
@jazzurro Something like that — I'm working on grouped data here (`mtcars %>% group_by(am) %>% summarize(t.test(.[vs == 0]$mpg, .[vs == 1]$mpg)$p.value)`), so I cannot just subset the original dataset variable. Incidentally, I’m also still not clear why using `filter` instead of subsetting in the above code doesn’t work. – Konrad Rudolph Sep 29 '15 at 06:44
1

@KonradRudolph Hi again. I ran your code above, but I received an error message. I stuck to `subset` again and wrote the following. Is this something you are after? `mtcars %>% group_by(am) %>% summarize(t.test(subset(mpg, vs == 0), subset(mpg, vs == 1))$p.value)` Please let me know if you need more. I am happy to help and think together. – jazzurro Sep 29 '15 at 07:33
@KonradRudolph The following returns an error message. But, I think that is not so far off from what you are looking for. `mtcars %>% group_by(am) %>% summarize(t.test(filter(.,vs == 0)[1], filter(.,vs == 1)[1])$p.value)` – jazzurro Sep 29 '15 at 07:38
@jazzurro Well as I said I hadn’t actually found a way of doing this, hence the error message. The way using `subset` works, and that does indeed solve my problem. However, I’d have preferred using `filter` here (as your last comment) shows — and that, as you’ve seen yourself, still doesn’t work. – Konrad Rudolph Sep 29 '15 at 13:30
1

@KonradRudolph Yep. I understand your point. Not sure why `filter` does not work. `filter(mtcars, vs == 0)[1]` works. So, the best guess is to write `mtcars %>% group_by(am) %>% summarize(out = t.test(filter(vs == 0)[1], filter(vs == 1)[1])$p.value)` or `mtcars %>% group_by(am) %>% summarize(out = t.test(filter(.,vs == 0)[1], filter(.,vs == 1)[1])$p.value)`. The former returns `Error: no applicable method for 'filter_' applied to an object of class "logical"` and the latter returns `Error: incorrect length (19), expecting: 13`. – jazzurro Sep 29 '15 at 13:57
@KonradRudolph I cannot provide the reason why. But I think there seems to be something which needs to be changed to make `filter` work here. – jazzurro Sep 29 '15 at 13:58
1

@KonradRudolph One more thing for you. Without `group_by`, the following is working for me. `mtcars %>% summarize(out = t.test(filter(mtcars,vs == 0)[1], filter(mtcars,vs == 1)[1])$p.value)` – jazzurro Sep 29 '15 at 14:17
@jazzurro Yes, without `group_by` there’s no problem. – Konrad Rudolph Sep 29 '15 at 14:24

score 14 · Answer 2 · answered Mar 21 '17 at 16:42

14

I like the following solution using the powerful "broom" package:

library("dplyr")
library("broom")

your_db %>%
  group_by(grouping_variable1, grouping_variable2 ...) %>%
  do(tidy(t.test(variable_u_want_2_test ~ dicothomous_grouping_var, data = .)))

answered Mar 21 '17 at 16:42

carfisma

307
3
7

here is an working example within the tidyverse: https://stats.stackexchange.com/questions/168378/applying-two-sample-t-test-comparing-multiple-groups-in-two-categories – Irakli Jul 23 '17 at 02:56

score 6 · Answer 3 · answered May 12 '15 at 02:03

Realizing that the question is fairly old, here is another answer for the reference of future generations.

This is more general than the accepted answer since it allows for dynamically generated variable names rather than hard-coded.

vars_to_test <- c("disp","hp","drat","wt","qsec")
iv <- "vs"

mtcars %>%
  summarise_each_(
    funs_( 
      sprintf("stats::t.test(.[%s == 0], .[%s == 1])$p.value",iv,iv)
    ), 
    vars = vars_to_test)

which produces this:

          disp           hp       drat           wt         qsec
1 2.476526e-06 1.819806e-06 0.01285342 0.0007281397 3.522404e-06

The idea of this solution is to use SE versions of dplyr functions (summarise_each_ and funs_) instead of NSE versions (summarise_each and funs). For more information about Standard Evaluation (SE) and Non-Standard Evaluation (NSE), please check vignette("nse").

Thanks for the solution! it works for me. However, i have two warning messages : 1: `summarise_each() is deprecated. Please use summarise_if(), summarise_at(), or summarise_all() instead: - To map "funs" over all variables, use summarise_all() - To map "funs" over a selection of variables, use summarise_at()` and 2: `funs_() is deprecated. Please use list() instead`. Is there an updated version of this code? Second question, is there a way to change the "1" (first character of the second row) by the name of the group (ie : "vs" in this case)? thanks for your help! — B_slash_, Mar 03 '20 at 10:06

score 2 · Answer 4 · edited May 01 '16 at 16:33

So I ended up hacking up a new function : df=dataframe , by_var=right hand side of formula, ... all variables on left hand side of formula (dplyr/tidyr select).

e.g: mult_t.test(mtcars,vs,disp:qsec)

mult_t.test<-function(df,by_var,...){
  require(dplyr)
  require(tidyr)
  by_var<-deparse(substitute(by_var))
  j<-df%>%gather(var,val,...)
  t<-j%>%group_by(var)%>%do(v=tes(.,by_var))
  k<-data.frame(levels(t$var),matrix(unlist(t$v),ncol=3,byrow = T))
  names(k)<-c("var",names(t$v[[1]]))
  k
}


tes<-function(df,vart){
  x<-t.test(df$val~df[[vart]])
  p<-x$estimate
  p<-c(p,p.val=x$p.value)
  p
}

dplyr summarise multiple columns using t.test

4 Answers4

Linked