2

How can I run svd and nmf on an extremely sparse matrix of dimensions, say, 70000 x 70000? The sparse version of this matrix can be stored as a less than 700M binary file on disk. Can I factorize it in a sparse format (like file on disk or storable in memory) without reconstructing the whole matrix which will be impossible to store in memory (even hard to store on disk)?

I know there are irlba in R, sklearn and pymf in python. But it seems they need to reconstruct the matrix? The problem of svd is that I cannot save the matrices S,V and D, but what if I specify a K and only save the matrices S_k, V_k and D_k corresponding to k-largest eigenvalue? And as for nmf, I want to factorize it into W with rank = 100, which can be stored in memory.

And if there are certain ways to do so, what is the expected time to compute svd and nmf? Any help will be appreciated!

zdebruine
  • 3,687
  • 6
  • 31
  • 50
fetcher
  • 83
  • 1
  • 10
  • with `Matrix` package `m2 <- Matrix(0, nrow = 7*10^4, ncol = 7*10^4, sparse = TRUE)` is only `281632 bytes`. – Khashaa Jan 03 '15 at 16:46
  • @Khashaa Thanks! I am not familiar with R. So can I svd sparse matrix like that in R? Especially consider that matrices S and D might be dense. – fetcher Jan 03 '15 at 16:53
  • possible duplicate of [SVD for sparse matrix in R](http://stackoverflow.com/questions/4951286/svd-for-sparse-matrix-in-r) – Rentrop Jan 03 '15 at 17:11
  • @Floo0 Thanks! I checked the link and I haven't seen the actual size of matrix in that problem. It seems that irlba works for svd (although I haven't tried yet). But does R save S and D in memory? And moreover, is there any method in R works for NMF? – fetcher Jan 03 '15 at 17:26
  • I read the document of irlba. It allows to specify the number of wanted singular vectors which is useful to limit the computation. At least for svd, irlba seems to be ok. – fetcher Jan 03 '15 at 17:31

2 Answers2

0

You can try to use the rARPACK package, which provides the svds() function that works on sparse matrix and allows you to retrieve only a few singular values/vectors.

See the README page for some examples.

yixuan
  • 774
  • 3
  • 6
0

Yes, I just wrote the RcppML R RcppEigen package for exactly this purpose. It is the fastest NMF implementation for sparse matrices of which I am aware.

GitHub: github.com/zdebruine/RcppML

install.packages("RcppML")
devtools::install_github("zdebruine/RcppML")

You did not say how sparse your matrix is, but based on the file size you quote I'm guessing it could be factorized in 1-5 minutes on an HPC at a very good tolerance.

I've been using RcppML::nmf to factorize datasets of millions of single-cells by 15,000 genes (95% sparse) in minutes on an HPC. It's almost as fast as equivalent irlba.

In RcppML::nmf, the R matrix does need to be loaded but will NOT be copied again in memory (set update_in_place = TRUE to avoid transposing and storing that copy in memory). You are correct that many implementations (including those in python) create a copy of the matrix. Also, any R package using RcppArmadillo or RcppEigen probably uses arma::SpMat or Eigen::SparseMatrix Rcpp classes, which require a deep copy. At 700MB you should be able to have your matrix in memory, or otherwise just use an HPC.

The next best algorithm of which I am aware is rsparse::WRMF R package. It's also really good, but does a shallow copy of the R vectors into Armadillo vectors.

zdebruine
  • 3,687
  • 6
  • 31
  • 50