7

I tried to perform independent t-test for many columns of a dataframe. For example, i created a data frame

set seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)

To run the test, i used with(df, t.test(y ~ group))

with(test_data, t.test(a ~ grp))
with(test_data, t.test(b ~ grp))
with(test_data, t.test(c ~ grp))

I would like to have the outputs like this

mean in group m mean in group y  p-value
9.747412        9.878820         0.6944
15.12936        16.49533         0.07798 
20.39531        20.20168         0.9027

I wonder how can I achieve the results using 1. for loop 2. apply() 3. perhaps dplyr

This link R: t-test over all columns is related but it was 6 years old. Perhaps there are better ways to do the same thing.

KIM
  • 157
  • 1
  • 1
  • 8

5 Answers5

8

Use select_if to select only numeric columns then use purrr:map_df to apply t.test against grp. Finally use broom:tidy to get the results in tidy format

library(tidyverse)

res <- test_data %>% 
  select_if(is.numeric) %>%
  map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res
#> # A tibble: 3 x 11
#>   var   estimate estimate1 estimate2 statistic p.value parameter conf.low
#>   <chr>    <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
#> 1 a       -0.259      9.78      10.0    -0.587   0.565      16.2    -1.19
#> 2 b        0.154     15.0       14.8     0.169   0.868      15.4    -1.78
#> 3 c       -0.359     20.4       20.7    -0.287   0.778      16.5    -3.00
#> # ... with 3 more variables: conf.high <dbl>, method <chr>,
#> #   alternative <chr>

Created on 2019-03-15 by the reprex package (v0.2.1.9000)

Tung
  • 26,371
  • 7
  • 91
  • 115
  • This looks very efficient, however I am getting the follwoing error: `Error in eval(predvars, data, env) : object 'group` not found. I thought this might be because `group` is not numeric, but I see here that `grp` is not numeric either. `@Tung` Any idea why this is? I have a dataset with 1 group column and 12 numeric columns - seems similar to me as above. – Wilkit Apr 22 '21 at 12:57
  • Can you make a separate question with reproducible code and data then post the link here? [How to make a great R reproducible example?](http://stackoverflow.com/questions/5963269) – Tung Apr 23 '21 at 04:17
  • This is late, but I encountered the same issue, any chance that you have come across a solution? @Wilkit – Jia Gao Apr 09 '22 at 10:41
  • @JasonGoal: Please post a separate question and put the link here – Tung Apr 09 '22 at 20:43
  • @Wilkit, you need to define `grp` separately, it's not in the `dataframe`, see my answer below. – Jia Gao May 09 '22 at 00:31
4

Simply extract the estimate and p-value results from t.test call while iterating through all needed columns with sapply. Build formulas from a character vector and transpose with t() for output:

formulas <- paste(names(test_data)[1:(ncol(test_data)-1)], "~ grp")

output <- t(sapply(formulas, function(f) {      
  res <- t.test(as.formula(f))
  c(res$estimate, p.value=res$p.value)      
}))

Input data (seeded for reproducibility)

set.seed(333)
a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)

Output result

#         mean in group m mean in group y   p.value
# a ~ grp        9.775477        10.03419 0.5654353
# b ~ grp       14.972888        14.81895 0.8678149
# c ~ grp       20.383679        20.74238 0.7776188
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Surprisingly, it works, see my comment to the answer by @DaWassi. Apparently, when indexing R is not considering the index value `0`. But more correct would be `names(test_data)[1:(ncol(test_data)-1)]`. – Rui Barradas Feb 21 '18 at 15:00
2

As you asked for a for loop:

  a <- rnorm(20, 10, 1)
  b <- rnorm(20, 15, 2)
  c <- rnorm(20, 20, 3)
  grp <- rep(c('m', 'y'),10)
  test_data <- data.frame(a, b, c, grp)  

  meanM=NULL
  meanY=NULL
  p.value=NULL

  for (i in 1:(ncol(test_data)-1)){
    meanM=as.data.frame(rbind(meanM, t.test(test_data[,i] ~ grp)$estimate[1]))
    meanY=as.data.frame(rbind(meanY, t.test(test_data[,i] ~ grp)$estimate[2]))
    p.value=as.data.frame(rbind(p.value, t.test(test_data[,i] ~ grp)$p.value))
   }

  cbind(meanM, meanY, p.value)

It works, but I am a beginner in R. So maybe there is a more efficient solution

DaWassi
  • 118
  • 6
1

Using lapply this is rather easy.
I have tested the code with set.seed(7060) before creating the dataset, in order to make the results reproducible.

tests_list <- lapply(letters[1:3], function(x) t.test(as.formula(paste0(x, "~ grp")), data = test_data))

result <- do.call(rbind, lapply(tests_list, `[[`, "estimate"))
pval <- sapply(tests_list, `[[`, "p.value")
result <- cbind(result, p.value = pval)

result
#     mean in group m mean in group y   p.value
#[1,]        9.909818        9.658813 0.6167742
#[2,]       14.578926       14.168816 0.6462151
#[3,]       20.682587       19.299133 0.2735725

Note that a real life application would use names(test_data)[1:3], not letters[1:3], in the first lapply instruction.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Nearly my same answer but with multiple *apply* calls. When iterating off same object, *tests_list*, do all under same method. – Parfait Feb 21 '18 at 15:24
  • @Parfait Yes, my idea was to first create an object holding all t.test results and then use that object to extract what is needed. – Rui Barradas Feb 21 '18 at 17:23
0

This should be a comment rather than an answer, but I'll make it an answer. The reason is that the accepted answer is awesome but with one caveat that may cost others hours, which is at least the case for me. The original data posted by OP

a <- rnorm(20, 10, 1)
b <- rnorm(20, 15, 2)
c <- rnorm(20, 20, 3)
grp <- rep(c('m', 'y'),10)
test_data <- data.frame(a, b, c, grp)

The answer provided by @Tung

library(tidyverse)

res <- test_data %>% 
  select_if(is.numeric) %>%
  map_df(~ broom::tidy(t.test(. ~ grp)), .id = 'var')
res

The problem, or more accurately, the caveat, of this answer is that one has to define the grp variable separately. Having the group variable outside of the dataframe is not a common practice as far as I know. So, even the answer is neat, it may be better to point out this operation (define group variable outside of the dataframe). Therefore, I use this comment like answer in the hope to save some time for those late comers.

Jia Gao
  • 1,172
  • 3
  • 13
  • 26