3

I have a dataframe with the following dimensions:

dim(b)  
[1]    974 433685

The columns represent variables that I want to run ANOVAs on (i.e., I want to run 433,685 ANOVAs). Sample size is 974. The last column is the 'group' variable.

I've come up with 3 different methods, but all are too slow due to the number of tests.

First, let's generate a small practice dataset to play with:

dat = as.data.frame(matrix(runif(10000*500), ncol = 10000, nrow = 500))
dat$group = rep(letters[1:10], 5000)

Method 1 (based on 'sapply'):

system.time(sapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))

   user  system elapsed 
 143.76    0.33  151.79 

Methods 2 (based on 'mclapply' from 'parallel' package):

library(parallel)
options(mc.cores=3)
system.time(mclapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))

   user  system elapsed 
 141.76    0.21  142.58 

Methods 3 (based on 'cbind'-ing the LHS):

formula = as.formula( paste0("cbind(", paste(names(dat)[-length(dat)],collapse=","), ")~group") ) 
system.time(aov(formula, data=dat))

  user  system elapsed 
  10.00    0.22   10.25 

In the practice dataset, Method 3 is a clear winner. However, when I do this on my actual data, computing on just 10 (of 433,685) columns using Method 3 takes this long:

   user  system elapsed
119.028   5.430 124.414

Not sure why it takes substantially longer on my actual data. I have access to a Linux cluster with upwards of 16 cores and 72GB of RAM.

Is there any way to compute this faster?

Community
  • 1
  • 1
Chad Johnson
  • 179
  • 1
  • 5
  • 3
    `433,685 ANOVAs`? Whatever do you want to do this for? There must be a better statistical approach for your problem. – Roland May 28 '15 at 09:23
  • 1
    That's a Bonferonni-corrected p-value of 1.152911e-07 – Paul Lemmens May 28 '15 at 11:42
  • Roland, I'm trying to quantify the "batch effect" (by my "group" variable), which is variation from an unwanted technical artefact. I have 433K individual probes for measuring DNA methylation at specific genomic loci. My thought was to sum the individual F-statistics of the ANOVA from each probe. I would then compare this number to that of analogous datasets generated from different pre-processing pipelines to find the one that removed more of the "batch effect." – Chad Johnson May 28 '15 at 13:02

1 Answers1

3

For simultaneously fitting many general linear models (such as ANOVA) using the same design matrix, the Bioconductor/R limma package provides a very fast lmFit() function. This is how to fit an ANOVA model using limma:

library(limma)

# generate some data 
# (same dimensions as in your question)
nrows <- 1e4
ncols <- 5e2
nlevels <- 10
dat <- matrix(
  runif(nrows * ncols), 
  nrow = nrows, 
  ncol = ncols
)
group <- factor(rep(
  letters[1:nlevels], 
  ncols / nlevels
))

# construct the design matrix
# (same as implicitly used in your question)
dmat <- model.matrix(~ group)
# fit the ANOVA model
fit <- lmFit(dat, dmat)

On my laptop it finished in 0.4 - 0.45 seconds, on data of the same dimensions as the data in your question.