3

I am trying to understand principal component analysis in Matlab,

There seems to be at least 3 different functions that do it.

I have some questions re the code below:

  1. Am I creating approximate x values using only one eigenvector (the one corresponding to the largest eigenvalue) correctly? I think so??

  2. Why are PC and V which are both meant to be the loadings for (x'x) presented differently? The column order is reversed because eig does not order the eigenvalues with the largest value first but why are they the negative of each other?

  3. Why are the eig values not in ordered with the eigenvector corresponding to the largest eigenvalue in the first column?

  4. Using the code below I get back to the input matrix x when using svd and eig, but the results from princomp seem to be totally different? What so I have to do to make princomp match the other two functions?

Code:

x=[1 2;3 4;5 6;7 8 ]

econFlag=0;

[U,sigma,V] = svd(x,econFlag);%[U,sigma,coeff] = svd(z,econFlag);

U1=U(:,1);
V1=V(:,1);
sigma_partial=sigma(1,1);

score1=U*sigma;
test1=score1*V';

score_partial=U1*sigma_partial;
test1_partial=score_partial*V1';



[PC, D] = eig(x'*x)

 score2=x*PC;
test2=score2*PC';

PC1=PC(:,2);
score2_partial=x*PC1;
 test2_partial=score2_partial*PC1';

[o1 o2 o3]=princomp(x);
A. Donda
  • 8,381
  • 2
  • 20
  • 49
Bazman
  • 2,058
  • 9
  • 45
  • 65
  • 1
    Your code lacks the part that (tries to) reconstruct `x` from the output of `princomp`. – A. Donda Jan 20 '14 at 21:30
  • 1
    related: http://stackoverflow.com/a/4403027, http://stackoverflow.com/a/3181851 – Amro Jan 20 '14 at 22:26
  • 1
    See this question regarding the order of the eigenvectors: http://stackoverflow.com/q/13704384/97160 – Amro Jan 20 '14 at 22:31

1 Answers1

5
  1. Yes. According to the documentation of svd, diagonal elements of the output S are in decreasing order. There is no such guarantee for the the output D of eig though.

  2. Eigenvectors and singular vectors have no defined sign. If a is an eigenvector, so is -a.

  3. I've often wondered the same. Laziness on the part of TMW? Optimization, because sorting would be an additional step and not everybody needs 'em sorted?

  4. princomp centers the input data before computing the principal components. This makes sense as normally the PCA is computed with respect to the covariance matrix, and the eigenvectors of x' * x are only identical to those of the covariance matrix if x is mean-free.


I would compute the PCA by transforming to the basis of the eigenvectors of the covariance matrix (centered data), but apply this transform to the original (uncentered) data. This allows to capture a maximum of variance with as few principal components as possible, but still to recover the orginal data from all of them:

[V, D] = eig(cov(x));

score = x * V;
test = score * V';

test is identical to x, up to numerical error.

In order to easily pick the components with the most variance, let's fix that lack of sorting ourselves:

[V, D] = eig(cov(x));
[D, ind] = sort(diag(D), 'descend');
V = V(:, ind);

score = x * V;
test = score * V';

Reconstruct the signal using the strongest principal component only:

test_partial = score(:, 1) * V(:, 1)';

In response to Amro's comments: It is of course also possible to first remove the means from the input data, and transform these "centered" data. In that case, for perfect reconstruction of the original data it would be necessary to add the means again. The way to compute the PCA given above is the one described by Neil H. Timm, Applied Multivariate Analysis, Springer 2002, page 446:

Given an observation vector Y with mean mu and covariance matrix Sigma of full rank p, the goal of PCA is to create a new set of variables called principal components (PCs) or principal variates. The principal components are linear combinations of the variables of the vector Y that are uncorrelated such that the variance of the jth component is maximal.

Timm later defines "standardized components" as those which have been computed from centered data and are then divided by the square root of the eigenvalues (i.e. variances), i.e. "standardized principal components" have mean 0 and variance 1.

A. Donda
  • 8,381
  • 2
  • 20
  • 49
  • 1
    when you sort the eigenvectors, I think you should only reorder the columns, as in `V = V(:,ind)`. Also like you mentioned, `x` must be zero-centered before multiplying it by `V` to compute `score` – Amro Jan 20 '14 at 22:31
  • 1
    You are right about the first thing, thanks for pointing it out! – But I disagree about the second. If the principal components should provide a maximum-variance decomposition, then the argument of `eig` needs to be computed from centered data. But as I wrote, I think it is better to compute the "scores", or actually principal components, on non-centered data, because only this way the original data can be perfectly recovered, i.e. only this way PCA provides a decomposition. – A. Donda Jan 20 '14 at 22:39
  • 1
    well yes, if you want to reconstruct the original data (possibly from truncated set of eigenvectors), you have to add the mean vector back after the multiplication. My point was just to get a `scores` variables similar to that returned by [`princomp`](http://www.mathworks.com/help/stats/princomp.html) function – Amro Jan 20 '14 at 22:48
  • 1
    @Amro, you can do it either way. I prefer this way because it is simpler, you don't need any special treatment of the means. My way is also the one described by Neil H. Timm, *Applied Multivariate Analysis*, Springer 2002, page 446. I now include the reference. – A. Donda Jan 20 '14 at 22:57
  • Thank you. I actually wrote it this way out of my gut preference, and you made me look it up. :-) – A. Donda Jan 20 '14 at 23:06
  • Thanks for all the answers. Just to check does anyone know for sure whether svd sorts the eigenvalues (and their associated eigenvectors) in descending order? I've never seen it do it any other way, although I suppose I am bading that on a relatively small sample. – Bazman Jan 21 '14 at 12:19
  • 2
    @Bazman: yes. Unlike `eig`, [`svd`](http://www.mathworks.com/help/matlab/ref/svd.html) explicitly states in the documentation that singular values are returned in decreasing order. – Amro Jan 21 '14 at 12:56