19

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.

So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.

What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.

Thanks very much for any help !

Note: I am using a machine with 24GB RAM and 8 cpu cores.

Ensom Hodder
  • 1,522
  • 5
  • 18
  • 35
  • I an not sure if it's 100% right, but I think that MatLab can do the job. – Anton May 23 '12 at 10:58
  • If you don't get any joy here, it might be worth asking on http://stats.stackexchange.com/ – NPE May 23 '12 at 11:08
  • @aix Thanks for your advices, I've moved it to the computational science beta, and get some useful hints. You may also follow it at this [URL](http://scicomp.stackexchange.com/questions/2313/apply-pca-on-very-large-sparse-matrix) – Ensom Hodder May 24 '12 at 09:23
  • @EnsomHodder: I didn't know about that new site. Looks interesting, thanks for pointing out :) – NPE May 24 '12 at 10:14

5 Answers5

15

The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).

Disclaimer: I'm on the scikit-learn development team.

EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Thanks a lot @larmans, to some degree, your proposed method can do PCA with the sparse matrix, but it can only calculate some small amount of PCs, due to the large memory consumption :-( – Ensom Hodder May 23 '12 at 14:56
  • 2
    Note that `RandomizedPCA` has been deprecated in favor of `PCA` with keyword argument `svd_solver='randomized'` – BallpointBen Dec 05 '16 at 16:49
7

Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.

With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.

If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.

As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.

user1149913
  • 4,463
  • 1
  • 23
  • 28
1

In R big.PCA of bigpca package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.

Andres Kull
  • 4,756
  • 2
  • 15
  • 13
0

text classification task

I resolved almost same problem using a technique for PCA of sparse matrix . This technique can handle very large sparse matrix. The result shows such simple PCA outperfoms word2vec. It intends the simple PCA outperforms LDA.

niitsuma
  • 117
  • 1
  • 4
0

I suppose you wouldn't be able to compute all principle components. But still you can obtain reduced dimension version of your dataset matrix. I've implemented a simple routine in MATLAB, which can easily be replicated in python.

Compute the covariance matrix of your input dataset, and convert it to a dense matrix. Assuming S is you input 120,000 * 22490 sparse matrix, this would be like:

Smul=full(S.'*S);
Sm=full(mean(S));
Sm2=120000*Sm.'*Sm;
Scov=Smul-Sm2; 

Apply eigs function on the covariance matrix to obtain the first N dominant eigenvectors,

[V,D] = eigs(Scov,N);

And obtain pcs by projecting the zero centered matrix on eigenvectors,

Sr=(S-Sm)*V; 

Sr is the reduced dimension version of S.

Samadi
  • 41
  • 6