Bug in Scikit-Learn PCA or in Numpy Eigen Decomposition?

Question

I have a dataset with 400 features.

What I did:

# approach 1
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov) # assume sort by eigen value also/ LA = numpy linear algebra

# approach 2 
pca = PCA(n_components=300)
d_fit = pca.fit_transform(d_train)
pc = pca.components_

Now, these two should be the same, right? as PCA is just the eigendecomposition of the covariance matrix.

But these are very different in my case?

How could that be, I am doing any mistake above?

Comparing variances:

import numpy as np
LA = np.linalg
d_train = np.random.randn(100, 10)
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov)

import matplotlib.pyplot as plt


from sklearn.decomposition import PCA
pca =  PCA(n_components=10)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
ve = pca.explained_variance_
#mypca[0,:], pc[0,:] pc.transpose()[0,:]

plt.plot(list(range(len(eigens))), [ x.transpose().dot(d_cov).dot(x) for x,y  in zip(mypca, eigens) ])
plt.plot(list(range(len(ve))), ve)
plt.show()

print(mypca, '\n---\n' , pc)

Possible duplicate of [In sklearn.decomposition.PCA, why are components\_ negative?](https://stackoverflow.com/questions/44765682/in-sklearn-decomposition-pca-why-are-components-negative) — P. Camilleri, Sep 29 '18 at 08:15
@P.Camilleri but even the variances are not the same? At least variances explained by the PCs and the eigenvectors should be the sam, right? Please see the edit. — Rafael, Sep 29 '18 at 09:03
FYI, use pc.T instead of pc.transpose(), it's shorter and you avoid creating a new array — P. Camilleri, Sep 29 '18 at 12:15
@P.Camilleri no, i have used the same formula xTCx for variance along a vector x, where C is the covariance matrix — Rafael, Sep 29 '18 at 18:08
you are iterating over rows when you should iterate over coluns. See my answer. — P. Camilleri, Sep 29 '18 at 18:33
@P.Camilleri, there's no difference between what `pc.T` produces and what `pc.transpose()` does. — hpaulj, Sep 29 '18 at 21:19

score 2 · Answer 1 · answered Sep 29 '18 at 18:32

You need to read the doc more carefully. numpy's doc is great and very thorough, very often you'll find the solution to your problem only by reading it.

Here is a modified version of your code (import on top of snippet, use of .T instead of .transpose(), pep8.)

import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA

from numpy import linalg as LA

d_train = np.random.randn(100, 10)
d_cov = np.cov(d_train.transpose())
eigens, mypca = LA.eig(d_cov)

pca = PCA(n_components=10)
d_fit = pca.fit_transform(d_train)
pc = pca.components_
explained = pca.explained_variance_

my_explained = np.sort([x.T.dot(d_cov).dot(x) for x in mypca.T])[::-1]

plt.close('all')
plt.figure()
plt.plot(my_explained, label='me')
plt.plot(explained, label='sklearn')
plt.legend()
plt.show(block=False)

The two curves are exactly the same. The important thing is that I iterate over my_pca.T, not my_pca.

Signature: np.linalg.eig(a)
Docstring:
Compute the eigenvalues and right eigenvectors of a square array.

Parameters
----------
a : (..., M, M) array
    Matrices for which the eigenvalues and right eigenvectors will
    be computed

Returns
-------
w : (..., M) array
    # not important for you

v : (..., M, M) array
    The normalized (unit "length") eigenvectors, such that the
    column ``v[:,i]`` is the eigenvector corresponding to the
    eigenvalue ``w[i]``.

The eigenvectors are returned as columns of my_pca, not rows. for x in my_pca was iterating over rows.

Bill · Answer 2 · 2018-09-29T06:55:45.777

I'm not an expert on PCA but I seem to get similar values if I transpose one of the matrices.

>>> import numpy as np
>>> LA = np.linalg
>>> d_train = np.random.randn(100, 10)
>>> d_cov = np.cov(d_train.transpose())
>>> eigens, mypca = LA.eig(d_cov)
>>> from sklearn.decomposition import PCA
>>> pca =  PCA(n_components=10)
>>> d_fit = pca.fit_transform(d_train)
>>> pc = pca.components_
>>> mypca[0,:]
array([-0.44255435, -0.77430549, -0.14479638, -0.06459874,  0.24772212,
        0.20780185,  0.22388151, -0.05069543, -0.14515676, -0.03385801])
>>> pc[0,:]
array([-0.44255435, -0.24050535, -0.17313927,  0.07182494,  0.09748632,
        0.17910516,  0.26125107,  0.71309764,  0.17276004,  0.25095447])
>>> pc.transpose()[0,:]
array([-0.44255435,  0.77430549,  0.14479638, -0.06459874,  0.24772212,
       -0.20780185,  0.22388151, -0.03385801,  0.14515676,  0.05069543])
>>> list(zip(pc.transpose()[:,0], mypca[:,0]))
[(-0.44255435328718207, -0.44255435328718096),
 (-0.24050535133912765, -0.2405053513391287),
 (-0.17313926714559819, -0.17313926714559785),
 (0.07182494253930383, 0.0718249425393035),
 (0.09748631534772645, 0.09748631534772684),
 (0.17910516453826955, 0.17910516453826758),
 (0.2612510722861703, 0.2612510722861689),
 (0.7130976419217306, 0.7130976419217326),
 (0.17276004381786172, 0.17276004381786136),
 (0.25095447415020183, 0.2509544741502009)]

@mourinho see the duplicate question I've flagged, there is a sign ambiguity when performing PCA — P. Camilleri, Sep 29 '18 at 08:16

Bug in Scikit-Learn PCA or in Numpy Eigen Decomposition?

2 Answers2

Linked