What's the fastest way to apply t.test to each column of a large matrix?

Question

Suppose I have a large matrix:

M <- matrix(rnorm(1e7),nrow=20)

Further suppose that each column represents a sample. Say I would like to apply t.test() to each column, is there a way to do this that is much faster than using apply()?

apply(M, 2, t.test)

It took slightly less than 2 minutes to run the analysis on my computer:

> system.time(invisible( apply(M, 2, t.test)))
user  system elapsed 
113.513   0.663 113.519

`apply` is very flexible function and thus includes lots of things you don't need in any particular case. Probably coding same logic manually with `for` loop will give some performance increase. — ffriend, Jul 12 '12 at 21:49

score 9 · Answer 1 · answered Jul 13 '12 at 07:55

You can do better than this with the colttests function from the genefilter package (on Bioconductor).

> library(genefilter)
> M <- matrix(rnorm(40),nrow=20)
> my.t.test <- function(c){
+   n <- sqrt(length(c))
+   mean(c)*n/sd(c)
+ }
> x1 <- apply(M, 2, function(c) my.t.test(c))
> x2 <- colttests(M, gl(1, nrow(M)))[,"statistic"]
> all.equal(x1, x2)
[1] TRUE
> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
   user  system elapsed 
 27.386   0.004  27.445 
> system.time(invisible(colttests(M, gl(1, nrow(M)))[,"statistic"]))
   user  system elapsed 
  0.412   0.000   0.414

Ref: "Computing thousands of test statistics simultaneously in R", SCGN, Vol 18 (1), 2007, http://stat-computing.org/newsletter/issues/scgn-18-1.pdf.

Ryogi · Accepted Answer · 2012-07-13T05:29:14.890

If you have a multicore machine there are some gains from using all the cores, for example using mclapply.

> library(multicore)
> M <- matrix(rnorm(40),nrow=20)
> x1 <- apply(M, 2, t.test)
> x2 <- mclapply(1:dim(M)[2], function(i) t.test(M[,i]))
> all.equal(x1, x2)
[1] "Component 1: Component 9: 1 string mismatch" "Component 2: Component 9: 1 string mismatch"
# str(x1) and str(x2) show that the difference is immaterial

This mini-example shows that things go as we planned. Now scale up:

> M <- matrix(rnorm(1e7), nrow=20)
> system.time(invisible(apply(M, 2, t.test)))
   user  system elapsed 
101.346   0.626 101.859
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i]))))
  user  system elapsed 
55.049   2.527  43.668

This is using 8 virtual cores. Your mileage may vary. Not a huge gain, but it comes from very little effort.

EDIT

If you only care about the t-statistic itself, extracting the corresponding field ($statistic) makes things a bit faster, in particular in the multicore case:

> system.time(invisible(apply(M, 2, function(c) t.test(c)$statistic)))
   user  system elapsed 
 80.920   0.437  82.109 
> system.time(invisible(mclapply(1:dim(M)[2], function(i) t.test(M[,i])$statistic)))
   user  system elapsed 
 21.246   1.367  24.107

Or even faster, compute the t value directly

my.t.test <- function(c){
  n <- sqrt(length(c))
  mean(c)*n/sd(c)
}

Then

> system.time(invisible(apply(M, 2, function(c) my.t.test(c))))
   user  system elapsed 
 21.371   0.247  21.532 
> system.time(invisible(mclapply(1:dim(M)[2], function(i) my.t.test(M[,i]))))
   user  system elapsed 
144.161   8.658   6.313

I think I will just compute t statistics directly, which as you showed, is much faster. — Alex, Jul 12 '12 at 22:28

What's the fastest way to apply t.test to each column of a large matrix?

2 Answers2