9

I'm trying to apply PCA on my data using princomp(x), that has been standardized.

The data is <16 x 1036800 double>. This runs our of memory which is too be expected except for the fact that this is a new computer, the computer holds 24GB of RAM for data mining. MATLAB even lists the 24GB available on a memory check.

Is MATLAB actually running out of memory while performing a PCA or is MATLAB not using the RAM to it's full potential? Any information or ideas would be helpful. (I may need to increase the virtual memory but assumed the 24GB would have sufficed.)

Amro
  • 123,847
  • 25
  • 243
  • 454
user379362
  • 341
  • 2
  • 5
  • 12
  • This may be better suited question for Serverfault. – Miro A. Jul 05 '10 at 18:55
  • 4
    @Miro A.: hmmmm no it is not. Nothing to do with servers or network – nico Jul 05 '10 at 19:22
  • @nico you are probably right. I was thinking of Superuser.com, not serverfault - my bad. – Miro A. Jul 06 '10 at 14:39
  • 1
    Don't forget that to hold a single matrix/vector you need to have a **contiguous** memory space available ... this is certainly not 24GB. After the total memory available, the `memory` command should also tell you what is the maximum size (max contiguous space) for a single variable. – Hoki Apr 22 '15 at 17:48

3 Answers3

20

For a data matrix of size n-by-p, PRINCOMP will return a coefficient matrix of size p-by-p where each column is a principal component expressed using the original dimensions, so in your case you will create an output matrix of size:

1036800*1036800*8 bytes ~ 7.8 TB

Consider using PRINCOMP(X,'econ') to return only the PCs with significant variance

Alternatively, consider performing PCA by SVD: in your case n<<p, and the covariance matrix is impossible to compute. Therefore, instead of decomposing the p-by-p matrix XX', it is sufficient to only decompose the smaller n-by-n matrix X'X. Refer to this paper for reference.


EDIT:

Here's my implementation, the outputs of this function match those of PRINCOMP (the first three anyway):

function [PC,Y,varPC] = pca_by_svd(X)
    % PCA_BY_SVD
    %   X      data matrix of size n-by-p where n<<p
    %   PC     columns are first n principal components
    %   Y      data projected on those PCs
    %   varPC  variance along the PCs
    %

    X0 = bsxfun(@minus, X, mean(X,1));     % shift data to zero-mean
    [U,S,PC] = svd(X0,'econ');             % SVD decomposition
    Y = X0*PC;                             % project X on PC
    varPC = diag(S'*S)' / (size(X,1)-1);   % variance explained
end

I just tried it on my 4GB machine, and it ran just fine:

» x = rand(16,1036800);
» [PC, Y, varPC] = pca_by_svd(x);
» whos
  Name             Size                     Bytes  Class     Attributes

  PC         1036800x16                 132710400  double              
  Y               16x16                      2048  double              
  varPC            1x16                       128  double              
  x               16x1036800            132710400  double              

Update:

The princomp function became deprecated in favor of pca introduced in R2012b, which includes many more options.

Amro
  • 123,847
  • 25
  • 243
  • 454
  • 2
    for those interested, see [this explanation](http://math.stackexchange.com/a/3871/133) on Math.SE of the relationship between PCA and SVD – Amro Dec 18 '12 at 09:54
  • 1
    May I ask the reason why you did `varPC = diag(S'*S)' / size(X,1);` instead of `varPC = diag(S'*S)' / (size(X,1)-1);`? In my case, using `(size(X,1)-1)` as the denominator gives me the same results as `princomp`. Is it the same as in "sample variance vs. population variance"? – Sibbs Gambling Apr 22 '15 at 16:48
  • May I ask two follow-up questions? (1) Why by this way, I sometimes get a "flipped" (as compared with `princomp`) eigenvector? I understand this doesn't matter for the definition's sake, but when I do projection with different eigenvectors, I end up with different projected data. (2) I understand the reason why we center the data. Why don't we project the uncentered data? Projecting the processed instead of raw data seems a bit deviating to me. – Sibbs Gambling Apr 23 '15 at 02:21
  • 1
    @SibbsGambling: Sorry for the late reply; (1) here are some related answers http://stackoverflow.com/a/13041293/97160, http://stackoverflow.com/a/18152804/97160, http://stackoverflow.com/a/13704582/97160, http://scicomp.stackexchange.com/a/16240/386 (2) what we usually do is subtract the mean first, perform PCA transform and project the data, then add the original mean back (after all it's just a translation) – Amro Apr 27 '15 at 14:23
1

Matlab has hardcoded limitations on matrix sizes. See this link. If you think you're not passing up those limits, then you probably have a bug in your code and actually are.

Donnie
  • 45,732
  • 10
  • 64
  • 86
0

Mathworks engineer Stuart McGarrity recorded a nice webinar surveying diagnosis techniques and common solutions. If you're data is indeed within allowed limits, the issue might be memory fragmentation - which is easily solvable.

Ofek Shilon
  • 14,734
  • 5
  • 67
  • 101