24

When trying to run the cor() function on sparse matrices (of either type dgCMatrix or dgTMatrix) I get the following error:

Error in cor(x) : supply both 'x' and 'y' or a matrix-like 'x'

Converting my matrix to be dense will be very inefficient. Is there an easy way to calculate this correlation (without an all pairs loop?).

Thanks,

  • Ron
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Ron
  • 991
  • 3
  • 9
  • 16

4 Answers4

17

EDITED ANSWER - optimized for memory use and speed.

Your error is logic, as a sparse matrix is not recognized by the cor function as a matrix, and there is -yet- no method for correlations in the Matrix package.

There is no function I am aware of that will let you calculate this, but you can easily calculate that yourself, using the matrix operators that are available in the Matrix package :

sparse.cor <- function(x){
  n <- nrow(x)
  m <- ncol(x)
  ii <- unique(x@i)+1 # rows with a non-zero element

  Ex <- colMeans(x)
  nozero <- as.vector(x[ii,]) - rep(Ex,each=length(ii))        # colmeans

  covmat <- ( crossprod(matrix(nozero,ncol=m)) +
              crossprod(t(Ex))*(n-length(ii))
            )/(n-1)
  sdvec <- sqrt(diag(covmat))
  covmat/crossprod(t(sdvec))
}

the covmat is your variance-covariance matrix, so you can calculate that one as well. The calculation is based on selecting the rows where at least one element is non-zero. to the cross product of this one, you add the colmeans multiplied by the number of all-zero rows. This is equivalent to

( X - E[X] ) times ( X - E[X] ) transposed

Divide by n-1 and you have your variance-covariance matrix. The rest is easy.

A test case :

X <- sample(0:10,1e8,replace=T,p=c(0.99,rep(0.001,10)))
xx <- Matrix(X,ncol=5)

> system.time(out1 <- sparse.cor(xx))
   user  system elapsed 
   0.50    0.09    0.59 
> system.time(out2 <- cor(as.matrix(xx)))
   user  system elapsed 
   1.75    0.28    2.05 
> all.equal(out1,out2)
[1] TRUE
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • @Joris wouldn't demeaning: `xminEx <- t(t(x) - colMeans(x)) # substract colmeans` break the sparsity? I believe this wouldn't fit in memory for a reasonably sized Matrix. – Ron May 04 '11 at 20:39
  • @Ron : indeed, but how else are you going to calculate a correlation? at one point, you have to calculate (X - E[X])(X - E[X])' – Joris Meys May 04 '11 at 20:52
  • @Joris I'm less versed in all of R's vectorized functions, so I'll move to a two variable example. Suppose I want the correlation of X and Y, both sparse (and very long) vectors. I would like to avoid addition or subtraction of non-zero items to all items of a vector. Since I need `(X-E[X])(Y-E[Y])`, I can do it as `(X*Y-E[X]*Y-X*E[Y])+E[X]*E[Y]`. This should maintain sparsity, as I'm assuming R can handle multiplications by zero and adding zeros in an efficient manner. Now, let's see if I can vectorize this to a matrix (matricize?). Help would be appreciated. – Ron May 04 '11 at 20:59
  • @Ron : the problem is that all zeros will become minus the mean of your variable, so it's not that easy to maintain sparsity during the calculations without resorting to loops. – Joris Meys May 04 '11 at 21:06
  • @Joris: Suppose X is of dimension n*p, with p << n (as is many times the case). In such case, X*E[X] should yield a p*p matrix, when E[X] is colmeans[X] in our case. The point is that colMeans(X) is multiplied by the colSums(X), so there's no need to multiply and then sum, it's better to sum and then multiply. See my implementation in my answer. Any improvements would be appreciated. – Ron May 04 '11 at 23:18
  • @Ron : I had more or less the same idea, and tried it out with a sparse matrix : better memory use and faster than the brute as.matrix() See my edited answer. – Joris Meys May 04 '11 at 23:32
  • Thanks! Apparently I can't post my answer until enough time passes. I will post it later and do comparisons. Try this: `sparse.cor2 <- function(x){ n <- nrow(x) covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))+n*colMeans(x)%o%colMeans(x))/(n-1) # variance covariance matrix, maintaining sparsity sdvec <- sqrt(diag(covmat)) # standard deviations of columns covmat/crossprod(t(sdvec)) # correlation matrix }` – Ron May 04 '11 at 23:34
  • @Ron : Sweet!!! it's faster than my answer on less sparse matrices, and about the same on the example i used above. If you post it, I upvote for sure. – Joris Meys May 04 '11 at 23:45
16

This is what I ended up using. Thanks @Joris for all the help!

My x matrix is quite big. Assuming it's size is n * p, n=200k and p=10k in my case.

The trick is to maintain the sparsity of the operations and perform the calculations on p * p matrices instead of p*n x n*p.

Version 1, is more straightforward, but less efficient on time and memory, as the outer product operation is expensive:

sparse.cor2 <- function(x){
    n <- nrow(x)

    covmat <- (crossprod(x)-2*(colMeans(x) %o% colSums(x))
              +n*colMeans(x)%o%colMeans(x))/(n-1)

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix
}

Version 2 is more efficient on time (saves several operations) and on memory. Still requires huge amounts of memory for a p=10k matrix:

sparse.cor3 <- function(x){
    memory.limit(size=10000)
    n <- nrow(x)

    cMeans <- colMeans(x)
    cSums <- colSums(x)

    # Calculate the population covariance matrix.
    # There's no need to divide by (n-1) as the std. dev is also calculated the same way.
    # The code is optimized to minize use of memory and expensive operations
    covmat <- tcrossprod(cMeans, (-2*cSums+n*cMeans))
    crossp <- as.matrix(crossprod(x))
    covmat <- covmat+crossp

    sdvec <- sqrt(diag(covmat)) # standard deviations of columns
    covmat/crossprod(t(sdvec)) # correlation matrix
}

Timing comparisons (sparse.cor is @Joris latest version):

> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> 
> object.size(x)
11999472 bytes
> 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   0.50    0.06    0.56 
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   0.13    0.00    0.12 
> system.time(correg <-cor(as.matrix(x)))
   user  system elapsed 
   0.25    0.03    0.29 
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx2)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE

Much larger x matrix:

> X <- sample(0:10,1e8,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> object.size(x)
120005688 bytes
> system.time(corx2 <- sparse.cor2(x))
   user  system elapsed 
   1.47    0.07    1.53 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
   1.18    0.09    1.29 
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
   5.43    1.26    6.71
Ron
  • 991
  • 3
  • 9
  • 16
11

The answer was solved elegantly by @Ron but a slight modification to the solution is a little cleaner and also returns the sample covariance matrix.

sparse.cor4 <- function(x){
    n <- nrow(x)
    cMeans <- colMeans(x)
    covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)
    sdvec <- sqrt(diag(covmat)) 
    cormat <- covmat/tcrossprod(sdvec)
    list(cov=covmat,cor=cormat)
}

The simplification comes from this: with an n x p matrix X, and an n x p matrix M of the column means of X:

cov(X) = E[(X-M)'(X-M)] = E[X'X - M'X - X'M + M'M] 

M'X = X'M = M'M, which have (i,j) elements = sum(column i) * sum(column j) / n

= n * mean(column i) * mean(column j)

or written with a row vector m of the column means,

= n * m'm

Then cov(X) = E[X'X - n m'm]


and it is now a smidge faster.

> X <- sample(0:10,1e7,replace=T,p=c(0.9,rep(0.01,10)))
> x <- Matrix(X,ncol=10)
> system.time(corx <- sparse.cor(x))
   user  system elapsed 
  1.139   0.196   1.334 
> system.time(corx3 <- sparse.cor3(x))
   user  system elapsed 
  0.194   0.007   0.201 
> system.time(corx4 <- sparse.cor4(x))
   user  system elapsed 
  0.187   0.007   0.194 
> system.time(correg <-cor(as.matrix(x)))
   user  system elapsed 
  0.341   0.067   0.407 
> system.time(covreg <- cov(as.matrix(x)))
   user  system elapsed 
  0.314   0.016   0.330 
> all.equal(c(as.matrix(corx)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx3)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cor)),c(as.matrix(correg)))
[1] TRUE
> all.equal(c(as.matrix(corx4$cov)),c(as.matrix(covreg)))
[1] TRUE
mike
  • 111
  • 2
  • 3
  • did you mean to write M'X = X'M = M'M or 'M'X + X'M - M'M' in your explanation? – Ron Mar 09 '12 at 23:28
  • I seem to get an `Error` since updating to latest packages (unclear when it first occured) `Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'as.matrix': requires numeric/complex matrix/vector arguments ` in `covmat <- (as.matrix(crossprod(x)) - n*tcrossprod(cMeans))/(n-1)` – bud.dugong Dec 13 '21 at 21:18
0

Using WGCNA::cor(sparseMat) worked for me.