1

I have a data matrix where every column corresponds to some measured substance concentrations and I need to regress every substance to every other substance, with some fixed correction covariates. As the design matrix is changing all the time, functions like fastLm() from the RcppArmadillo package are not substantially useful (I checked this before).

The very naive and unadvisable idea is to make a for loop, like

Ncol <- ncol(mat)
mat1 <- mat2 <- matrix(ncol=Ncol, nrow=Ncol) ## matrices where I'll save what I need

for(i in seq(Ncol)) {
  for(j in seq(Ncol)[-i]) {
    mylm <- lm(mat[,i] ~ mat[,j] + covariates)
    mat1[i,j] <- summary(mylm)$something
    mat2[i,j] <- summary(mylm)$something.else
  }
}

which I am actually currently running, as I had no better ideas. I am not familiar with vectorization algorithms, but I am pretty sure it would kick the speed up a notch.

Does anybody have any suggestion about how to make the computation faster? I have to run the analysis on 4 datasets, with approximately 300, 650, 800, 2000 columns each...

Community
  • 1
  • 1
jeiroje
  • 85
  • 9
  • What are `something` and `something.else` in your case? – Julius Vainora Mar 30 '15 at 10:23
  • 1
    Oh, I left that on purpuose. Actually I need three data: the p-value for `mat[,j]`, R-squared and the number of samples used. Are you thinking of dropping the `lm` part and calculating those values directly from the data? – jeiroje Mar 30 '15 at 10:31
  • The loop might not be such a bad idea in those circumstances. For speeding up, have you thought of using parrallel computing (using a package such as `doSNOW`, for instance? – Dominic Comtois Mar 30 '15 at 10:36
  • @jeiroje, yes, that is what I am going to try – Julius Vainora Mar 30 '15 at 10:45
  • @Julius I did not give it much thought, but that could actually be the turning point. I should revise my math a bit ;-) – jeiroje Mar 30 '15 at 10:57
  • @DominicComtois I was just about to try it (I am not familiar with parallel computing either), but I got an error message from the lm() and I am currently trying to solve that one instead. But I will definitely do that as soon as the function works properly. – jeiroje Mar 30 '15 at 10:57
  • I checked and got that working with matrices is much faster than using your loop, but then @Andrey Shabalin's approach is even better, so I am not going to finish mine. – Julius Vainora Mar 30 '15 at 12:05

1 Answers1

0

This is pretty much what my package MatrixEQTL does. Here is a code for running it on your data. (Cannot test it without your data, so let me know if something is wrong.)

library(MatrixEQTL)
data = SlicedData$new(t(mat))
cvrt = SlicedData$new(t(covariates))
me = Matrix_eQTL_engine(
   snps = data,
   gene = data,
   cvrt = cvrt,
   output_file_name = NULL,
   pvOutputThreshold = 1,
   useModel = modelLINEAR, 
   errorCovariance = numeric(), 
   verbose = TRUE,
   pvalue.hist = FALSE,
   min.pv.by.genesnp = FALSE,
   noFDRsaveMemory = FALSE);

The t-statistics and p-values will be in me$all$eqtls.

Andrey Shabalin
  • 4,389
  • 1
  • 19
  • 18
  • Thanks @Andrey! I am checking your package right now and will let you know how it works! – jeiroje Mar 30 '15 at 12:35
  • Hi @Andrey, sorry but I had no internet access for some days. I can not say if it worked, as I could not understand the output in `me$all$eqtls`so well. The reference manual was also not so clear either - I do not know if it was just me, but it would be nice to have some more interpretation details on that. To my problem: it turned out a couple of days ago that we will proceed without linear models, so eventually I am not going to need these calculations, but I will keep it in mind if I'll need it again (and if I get to understand the outputs). Thanks anyway! – jeiroje Apr 10 '15 at 09:48