I have a dataframe with the following dimensions:
dim(b)
[1] 974 433685
The columns represent variables that I want to run ANOVAs on (i.e., I want to run 433,685 ANOVAs). Sample size is 974. The last column is the 'group' variable.
I've come up with 3 different methods, but all are too slow due to the number of tests.
First, let's generate a small practice dataset to play with:
dat = as.data.frame(matrix(runif(10000*500), ncol = 10000, nrow = 500))
dat$group = rep(letters[1:10], 5000)
Method 1 (based on 'sapply'):
system.time(sapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
143.76 0.33 151.79
Methods 2 (based on 'mclapply' from 'parallel' package):
library(parallel)
options(mc.cores=3)
system.time(mclapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
141.76 0.21 142.58
Methods 3 (based on 'cbind'-ing the LHS):
formula = as.formula( paste0("cbind(", paste(names(dat)[-length(dat)],collapse=","), ")~group") )
system.time(aov(formula, data=dat))
user system elapsed
10.00 0.22 10.25
In the practice dataset, Method 3 is a clear winner. However, when I do this on my actual data, computing on just 10 (of 433,685) columns using Method 3 takes this long:
user system elapsed
119.028 5.430 124.414
Not sure why it takes substantially longer on my actual data. I have access to a Linux cluster with upwards of 16 cores and 72GB of RAM.
Is there any way to compute this faster?