0

I have 2 large matrices (typically of dimensions 5000*40 and 20000*40). I am trying to create a correlation matrix, where i would like to calculate the correlation of each row from the first matrix to every other row in the second matrix. I have following minimal code, by it takes extremely long time. Any recommendations to speed it up or parallize. Thanks

-Jaison

nprots <- 50 #usually ca. 5000
ngenes <- 1000 #usually ca. 20000

a_mat <- matrix( runif(40*nprots, 120, 116000), ncol=40)
b_mat <- matrix( runif(40*ngenes, 0.1, 1000), ncol=40)
system.time(apply( a_mat, 1, function(xx) 
apply(b_mat, 1, cor, y = xx, use = "pairwise.complete.obs")) -> cor_mat)
JaJ
  • 25
  • 3
  • 1
    Possible duplicate of [Fast correlation in R using C and parallelization](http://stackoverflow.com/questions/18964837/fast-correlation-in-r-using-c-and-parallelization) – Richard Erickson Oct 14 '15 at 16:50
  • Have looked through the SO help files? It looks like this post answer's [your question](http://stackoverflow.com/questions/18964837/fast-correlation-in-r-using-c-and-parallelization). – Richard Erickson Oct 14 '15 at 16:51

1 Answers1

0

Thanks Richard, I did see the post you refered to on SO, but passed it over without thinking too much about it because it did not seem to pertain to my problem. I have 2 matrices across which I perform correlations. A little more thought on this and it ocurred to my dim brain that I can simply extend the solution you pointed out to suit my question. rbind the two matrices, then follow the solution you refered to. Finally I extract out the relevant corner of the correlation matrix. This works faster than my original code using apply and cor. I double checked the answers and all seem ok. So below is my current solution.

fast_cor <- function(a,b) {
  mat  <- rbind(a, b);
  mat  <- mat - rowMeans(mat);
  mat  <- mat / sqrt(rowSums(mat^2));
  cr   <- tcrossprod(mat)
  edge <- dim(a_mat)[1]
  cr   <- t(cr[1:edge, -c(1:edge)])
  return(cr)
}

nprots <- 50 #usually ca. 5000
ngenes <- 10000 #usually ca. 20000

a_mat <- matrix( runif(40*nprots, 120, 116000), ncol=40)
b_mat <- matrix( runif(40*ngenes, 0.1, 1000), ncol=40)
system.time(apply( a_mat, 1, function(xx) 
 apply(b_mat, 1, cor, y = xx, use = "pairwise.complete.obs")) -> c_1)
user  system elapsed 
20.48    0.00   20.48 
system.time(c_2 <- fast_cor(a_mat, b_mat))
user  system elapsed 
1.97    0.11    2.08 
JaJ
  • 25
  • 3
  • For future reference, please cite that other question in your own question and let us know how yours differs. That way people will see you've done some homework and put effort into your question. – Richard Erickson Oct 14 '15 at 18:50
  • Thanks for the tip Richard. First time post. I will attempt to do this next time. – JaJ Oct 14 '15 at 19:19
  • I still have the problem of dealing with NAs. If for example, the first element in a_mat is NA, the current output has the entire first column as NA. Instead, I would like to have the correlation calculated on the remaining data. Any work around this? – JaJ Oct 14 '15 at 19:25