12

I have about 1000 vectors x_i of dimension 50000, but they are very sparse; each has only about 50-100 nonzero elements. I want to do PCA on this dataset (in MATLAB) to reduce the unneeded extreme dimensionality of the data.

Unfortunately, I don't know any way to do this without an intermediate full matrix due to the need to subtract the means from all examples. And of course, a 1000x50000 matrix is too big to fit into memory (it actually crashes my entire computer for some reason when I try). Matlab's built in princomp crashes my computer when I try to use it, too.

So my question is: is there a way to do PCA on this data without requiring a massive non-sparse matrix as an intermediate step?

Sean
  • 3,002
  • 1
  • 26
  • 32

6 Answers6

6

You don't need to form the full data matrix to subtract the means, OR to compute the covariance matrix. Just compute the 1000x1000 covariance matrix iteratively (loop through the data vectors). Once you have formed the covariance matrix, you can implicitly subtract the means by centering the covariance matrix. See the section at the end of this paper on kernel PCA explaining how to center a kernel matrix. Just consider the kernel matrix basically the same as the covariance matrix.

flubdub
  • 76
  • 1
  • Actually, if you are representing the data matrix with the 'sparse matrix' type in MATLAB, you don't need to calculate covariance matrix iteratively. Just make sure you don't subtract the means beforehand, but instead center the resulting covariance matrix. – flubdub Nov 19 '12 at 19:03
1

To calculate the PCA of mentioned dataset, the algorithm just needs to operate on the 1000x1000 covariance matrix. This shouldn't be a big deal for most PCA implementations, I guess. If you are using Windows 7 computer, you could try to use a 64bit implementation of PCA. I am not sure that Matlab supports 64bit PCA, but application like VisuMap can easily handle those cases.

1

The following strategy works:

[~,~,PC] = svds(X,k);
mu = mean(X);
S = sparse(size(X,1),k);
for i=1:size(X,1)
    S(i,:) = (X(i,:)-mu)*PC;
end

The right singular vectors of X are the eigenvectors of cov(X,1), and thus the principal components of X. By computing the principal component scores instancewise instead of all at once, you can avoid the memory overflows that come with transitioning from sparse to full. Just be sure to make k<<p and you should be fine.

PikalaxALT
  • 327
  • 3
  • 12
0

You don't need to use princomp. This answer will explain how you do it with eig. Replace eig with eigs.

Community
  • 1
  • 1
Jacob
  • 34,255
  • 14
  • 110
  • 165
  • That answer doesn't help me unfortunately... I know how to find the principal components of a dataset, it's just the massive size that's screwing me up. I can't do something like `X = bsxfun(@minus, meas, mean(meas));` because it turns my sparse matrix into a full one too large to fit into memory. – Sean Nov 16 '12 at 23:50
  • Sorry, I should say: calling `cov(X)` on my dataset has the same problem. – Sean Nov 16 '12 at 23:53
  • @Sean: This also might be of help: [MATLAB is running out of memory but it should not be](http://stackoverflow.com/a/3181851/97160) – Amro Dec 18 '12 at 09:48
0

First, you don't need the covariance matrix to subtract the mean.

Then to compute the PCs, see answers to this question.

Community
  • 1
  • 1
chaohuang
  • 3,965
  • 4
  • 27
  • 35
  • I know that, it's not specifically the covariance matrix that's breaking things, it's that any method I know of (including the popular one involving computation of a covariance matrix, or just subtracting the mean and doing SVD, etc...) involves a non-sparse matrix too big to fit into memory. Not sure what you're referencing within the question you linked... I'm looking for either a way to do this within Matlab or a generic mathematical answer. – Sean Nov 17 '12 at 09:22
0

For the top PC, see iterative PCA; this accumulates sums of 50k dense . 50k sparse, should work.
For the second one, subtract off the first on the fly, i.e. use (X - U1 d1 Vt1) without instantiating it.
(Randomized PCA does that in Python scikit-learn, Matlab dunno.)

denis
  • 21,378
  • 10
  • 65
  • 88