10

I've got a sparse Matrix in R that's apparently too big for me to run as.matrix() on (though it's not super-huge either). The as.matrix() call in question is inside the svd() function, so I'm wondering if anyone knows a different implementation of SVD that doesn't require first converting to a dense matrix.

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • I can't find anything for R. Plenty of stuff for C, Fortran, Python etc. – David Heffernan Feb 09 '11 at 22:38
  • Maybe I'll try out SVDLIBC. It builds as a C library, so if it works well I could in the future wrap it as a module (though my ambition probably won't hold up that long, if history is any guide...). – Ken Williams Feb 10 '11 at 17:32
  • 2
    How about this http://cran.r-project.org/web/packages/irlba/ A fast and memory-efficient method for computing a few approximate singular values and singular vectors of large matrices. – David Dehghan Feb 23 '12 at 08:29

4 Answers4

9

The irlba package has a very fast SVD implementation for sparse matrices.

Zach
  • 29,791
  • 35
  • 142
  • 201
  • 1
    The comment mentioning package `irlba` (third under the question) was late by one year when posted. Your answer duplicates the comment *another year* later... – Ferdinand.kraft May 01 '13 at 01:48
  • 1
    @Ferdinand.kraft I apologize, as I missed the comment. This page is the first result when one searches for "R svd sparse," and considering that `irlba` is the best R package for sparse svd, it seems like an appropriate answer. If the author of the comment would like to post an answer, I'd be happy to upvote it and delete mine. – Zach May 01 '13 at 02:55
7

You can do a very impressive bit of sparse SVD in R using random projection as described in http://arxiv.org/abs/0909.4061

Here is some sample code:

# computes first k singular values of A with corresponding singular vectors
incore_stoch_svd = function(A, k) {
  p = 10              # may need a larger value here
  n = dim(A)[1]
  m = dim(A)[2]

  # random projection of A    
  Y = (A %*% matrix(rnorm((k+p) * m), ncol=k+p))
  # the left part of the decomposition works for A (approximately)
  Q = qr.Q(qr(Y))
  # taking that off gives us something small to decompose
  B = t(Q) %*% A

  # decomposing B gives us singular values and right vectors for A  
  s = svd(B)
  U = Q %*% s$u
  # and then we can put it all together for a complete result
  return (list(u=U, v=s$v, d=s$d))
}
Ted Dunning
  • 1,877
  • 15
  • 12
  • 2
    So I tried it out and it looks like it doesn't give very good results unless I jack `p` way up, in which case it doesn't save much resources. As a test, I made a random sparse 10000x12000 matrix with 1000 nonzero entries sampled as runif(1000), which should have eigenvalues around 0.999 or 1. But this method shows the first few eigenvalues as `0.8461391, 0.8423876, 0.8353727, 0.8321352, 0.8271768, 0.8203687`. – Ken Williams May 09 '11 at 21:02
  • 2
    Read the original paper. If your eigenvalues are all about the same value, then it won't save you much as it stands. In such cases, you need to do a few iterations of squaring your source matrix to get a better spread. – Ted Dunning Jul 16 '11 at 19:43
  • Try it again on a test matrix with limited rank. – Ted Dunning Jul 16 '11 at 19:44
5

So here's what I ended up doing. It's relatively straightforward to write a routine that dumps a sparse matrix (class dgCMatrix) to a text file in SVDLIBC's "sparse text" format, then call the svd executable, and read the three resultant text files back into R.

The catch is that it's pretty inefficient - it takes me about 10 seconds to read & write the files, but the actual SVD calculation takes only about 0.2 seconds or so. Still, this is of course way better than not being able to perform the calculation at all, so I'm happy. =)

Ken Williams
  • 22,756
  • 10
  • 85
  • 147
  • Followup question posted at http://stackoverflow.com/questions/5009026/extract-long-from-r-object. – Ken Williams Feb 15 '11 at 21:11
  • 1
    Giving yourself a checkmark without offering any example data or solution code seems contrary to the principles of SO. – IRTFM Sep 29 '12 at 15:29
  • I'm no longer at the company where I did this, so I don't have access to that code. I meant the checkmark to mean "this is the solution that I actually ended up using." It's pretty different from the other suggestions, and very straightforward to implement. To boot, it's really the "wrong" way to do it, because it requires writing text files representing numeric matrices. So I don't know if it's really worth copying as an example. – Ken Williams Sep 30 '12 at 02:07
  • This is an old question now. Anyone going into the same area now would be well advised to follow Zach's recommendation below to use irlba, ignoring the checkmark on this answer. – mc0e Nov 10 '13 at 04:13
3

rARPACK is the package you need. Works like a charm and is Superfast because it parallelizes via C and C++.

  • In 2017, rArpack's v0.11.0 DESCRIPTION file says: Now rARPACK becomes a simple shell of the RSpectra package. – knb Nov 12 '17 at 21:21
  • Jumping in even more years later, I have to say rARPACK and RSpectra are amazing packages. irlba has a special place in my heart, but RSpectra is amazingly fast – Zach Nov 26 '17 at 00:06