What's wrong with my PCA?

Question

My code:

from numpy import *

def pca(orig_data):
    data = array(orig_data)
    data = (data - data.mean(axis=0)) / data.std(axis=0)
    u, s, v = linalg.svd(data)
    print s #should be s**2 instead!
    print v

def load_iris(path):
    lines = []
    with open(path) as input_file:
        lines = input_file.readlines()
    data = []
    for line in lines:
        cur_line = line.rstrip().split(',')
        cur_line = cur_line[:-1]
        cur_line = [float(elem) for elem in cur_line]
        data.append(array(cur_line))
    return array(data)

if __name__ == '__main__':
    data = load_iris('iris.data')
    pca(data)

The iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Output:

[ 20.89551896  11.75513248   4.7013819    1.75816839]
[[ 0.52237162 -0.26335492  0.58125401  0.56561105]
 [-0.37231836 -0.92555649 -0.02109478 -0.06541577]
 [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]
 [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

Desired Output:
Eigenvalues - [2.9108 0.9212 0.1474 0.0206]
Principal Components - Same as I got but transposed so okay I guess

Also, what's with the output of the linalg.eig function? According to the PCA description on wikipedia, I'm supposed to this:

cov_mat = cov(orig_data)
val, vec = linalg.eig(cov_mat)
print val

But it doesn't really match the output in the tutorials I found online. Plus, if I have 4 dimensions, I thought I should have 4 eigenvalues and not 150 like the eig gives me. Am I doing something wrong?

edit: I've noticed that the values differ by 150, which is the number of elements in the dataset. Also, the eigenvalues are supposed to add to be equal to the number of dimensions, in this case, 4. What I don't understand is why this difference is happening. If I simply divided the eigenvalues by len(data) I could get the result I want, but I don't understand why. Either way the proportion of the eigenvalues isn't altered, but they are important to me so I'd like to understand what's going on.

They are not incorrect. Eigenvalues/eigenvectors are not completely determined (can vary by any scale factor, or even have signs). The actual value you get out depends on the algorithm. (If you do a search you will find a lot of similar queries about obaining eigenvalues with 'wrong' sign which probably explain the issue better). — robince, Jan 30 '11 at 18:03

doug · Accepted Answer · 2011-01-28T08:31:11.527

10

You decomposed the wrong matrix.

Principal Component Analysis requires manipulating the eigenvectors/eigenvalues of the covariance matrix, not the data itself. The covariance matrix, created from an m x n data matrix, will be an m x m matrix with ones along the main diagonal.

You can indeed use the cov function, but you need further manipulation of your data. It's probably a little easier to use a similar function, corrcoef:

import numpy as NP
import numpy.linalg as LA

# a simulated data set with 8 data points, each point having five features
data = NP.random.randint(0, 10, 40).reshape(8, 5)

# usually a good idea to mean center your data first:
data -= NP.mean(data, axis=0)

# calculate the covariance matrix 
C = NP.corrcoef(data, rowvar=0)
# returns an m x m matrix, or here a 5 x 5 matrix)

# now get the eigenvalues/eigenvectors of C:
eval, evec = LA.eig(C)

To get the eigenvectors/eigenvalues, I did not decompose the covariance matrix using SVD, though, you certainly can. My preference is to calculate them using eig in NumPy's (or SciPy's) LA module--it is a little easier to work with than svd, the return values are the eigenvectors and eigenvalues themselves, and nothing else. By contrast, as you know, svd doesn't return these these directly.

Granted the SVD function will decompose any matrix, not just square ones (to which the eig function is limited); however when doing PCA, you'll always have a square matrix to decompose, regardless of the form that your data is in. This is obvious because the matrix you are decomposing in PCA is a covariance matrix, which by definition is always square (i.e., the columns are the individual data points of the original matrix, likewise for the rows, and each cell is the covariance of those two points, as evidenced by the ones down the main diagonal--a given data point has perfect covariance with itself).

edited Jan 28 '11 at 08:31

answered Jan 26 '11 at 09:48

doug

69,080
24
165
199

1

Sometimes it is worth calculating the means and standard deviations separately, and then calculating covariance. This is useful if, for example, one wants to reverse the process after manipulating the principal components. – Predictor Jan 26 '11 at 11:14
I agree--in this case, the OP seemed to want something that 'just worked', and i find doing it this way (often) simpler, particular when working from the command line. – doug Jan 26 '11 at 12:36
3

I think this is wrong. As you say PCA results in the eigenvalues/vectors of the covariance matrix, but in almost every case I've seen these can be obtained by performing SVD on the (de-meaned) data matrix itself. This performs much better since it removes the need to calculate the full covariance matrix, and is also more numerically stable (I don't know the details but I understand SVD is more stable than eig). Can't find a more accessible description but you can see here: http://public.lanl.gov/mewall/kluwer2002.html This is the method used in every serious implementation I have seen. – robince Jan 30 '11 at 17:54
1

Here is a nicer set of notes which explains the relationship between PCA and SVD http://www.snl.salk.edu/~shlens/pca.pdf – robince Jan 30 '11 at 18:11
@robince doug's answer is indeed wrong. SVD can be used to get the eigenvectors directly from the data (scaled by (N-1)). The left singular values of A (returned by the SVD) are the eigenvectors of AA^T (if A is real). So I think this answer is misleading and should be edited. – Julien Mar 19 '12 at 16:22
no, not wrong at all. First, it gives the correct answer. It looks like you somehow forgot this *and* read the last paragraph to say that "svd will not work". That is not what is says. It begins with *the SVD function will decompose any matrix*. I am just expressing a preference for doing PCA by a computationally simpler technique versus SVD-- "my preference." I do *not* say that SVD won't work. In fact, i state that is a more general solution. Your objection here is "SVD can be used...." No one said otherwise--i said it first in my answer "SVD function will decompose any matrix...." – doug Mar 19 '12 at 18:45
5

No, the first line is wrong "You decomposed the wrong matrix.". There are basically two ways to do PCA : 1) Either you compute the eigenvectors of the *covariance* matrix 2) Or you compute the SVD of the *data* matrix and the left singular vectors are the eigenvectors of the *covariance* matrix So in the second case (SVD), you don't need to compute the covariance at all. In the original question, he's applying the SVD on the data matrix, which is totally what he should do. He just forgot to normalize his data by 1/(N-1), which is why he gets a 150 (N) factor. (See my answer below) – Julien Mar 21 '12 at 10:25
Shouldn't that be rather `NP.cov(data, rowvar=0)`? instead of `corrcoef` @doug – Simon Righley Jun 11 '13 at 17:21
@Julien Does it matter if data is col-wise/row-wise and if features are larger than samples when computing SVD? [u,s,v] = np.linalg.svd(data), u is the eigenvectors , s**2 is eigenvalues? Thanks. – June Wang Sep 11 '19 at 04:10

score 3 · Answer 2 · answered Mar 19 '12 at 16:30

The left singular values returned by SVD(A) are the eigenvectors of AA^T.

The covariance matrix of a dataset A is : 1/(N-1) * AA^T

Now, when you do PCA by using the SVD, you have to divide each entry in your A matrix by (N-1) so you get the eigenvalues of the covariance with the correct scale.

In your case, N=150 and you haven't done this division, hence the discrepancy.

This is explained in detail here

score 2 · Answer 3 · answered Jan 26 '11 at 06:00

(Can you ask one question, please? Or at least list your questions separately. Your post reads like a stream of consciousness because you are not asking one single question.)

You probably used cov incorrectly by not transposing the matrix first. If cov_mat is 4-by-4, then eig will produce four eigenvalues and four eigenvectors.
Note how SVD and PCA, while related, are not exactly the same. Let X be a 4-by-150 matrix of observations where each 4-element column is a single observation. Then, the following are equivalent:

a. the left singular vectors of X,

b. the principal components of X,

c. the eigenvectors of X X^T.

Also, the eigenvalues of X X^T are equal to the square of the singular values of X. To see all this, let X have the SVD X = QSV^T, where S is a diagonal matrix of singular values. Then consider the eigendecomposition D = Q^T X X^T Q, where D is a diagonal matrix of eigenvalues. Replace X with its SVD, and see what happens.

What about a 150-by-4 matrix X, where each column is a sample. u,s,v = np.linalg.svd(X) will return u of shape (150,150), s shape (4,), v shape (4,4). But if we do u,s,v = np.linalg.svd(np.dot(X, X^T)/(4-1)), it'll return u of shape (150,150), s shape (150,), v shape(150,150). But the former s shape don't match the eig_val shape if we do eig_val,eig_vec= np.linalg.eig(np.dot(X, X^T)/(4-1)). So — June Wang, Sep 11 '19 at 07:16

score 0 · Answer 4 · edited May 23 '17 at 10:29

0

Question already adressed: Principal component analysis in Python

edited May 23 '17 at 10:29

Community

1
1

answered Jan 26 '11 at 04:08

Foo Bah

25,660
5
55
79

2

the one person in that thread who implemented it also used the svd function which leads to different eigenvalues – pcapcapcapca Jan 26 '11 at 04:14

What's wrong with my PCA?

4 Answers4

Linked