Why are my manual PCA reconstructions not matching python's sklearn's reconstructions?

Question

I was trying to check my implementation of PCA to see if I understood it and I tried to do PCA with 12 components on the MNIST data set (which I got using the tensorflow interface that normalized it for me). I obtained the principal components given by sklearn and then made reconstructions as follow:

pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train)
# do manual PCA
U = pca.components_
my_reconstruct = np.dot(  U.T , np.dot(U, X_train.T) ).T

then I used the reconstruction interface given by sklearn to try to reconstruct as follow:

pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train)
X_reconstruct = pca.inverse_transform(X_pca)

and then checked the error as follow (since the rows are a data point and columns features):

print 'X_recon - X_my_reconstruct', (1.0/X_my_reconstruct.shape[0])*LA.norm(X_my_reconstruct - X_reconstruct)**2
#X_recon - X_my_reconstruct 1.47252586279

the error as you can see is non-zero and actually quite noticeable. Why is it? How is their reconstruction different from mine?

Never implemented PCA myself, but did you consider looking into the sources. Their operations is very different [link](https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/decomposition/pca.py) (which gives a hint about different algorithmics / internal data-structures). — sascha, Aug 06 '16 at 02:20

score 2 · Answer 1 · edited May 23 '17 at 12:10

I see a couple of issues:

The dot product should be X_pca.dot(pca.components_). PCA factorizes your X_train matrix using SVD:

X_train = U·S·Vᵀ.

Here, pca.components_ corresponds to Vᵀ (a (k, n_features) matrix), not U (an (n_datapoints, k) matrix).

The sklearn implementation of PCA is quite readable, and can be found here. I also wrote a pure numpy example in this previous answer.
Did you center X_train by subtracting the mean value for each column before doing the fitting?

The PCA class automatically centers your data and stores the original mean vector in its .mean_ attribute. If the mean vector for your input features was nonzero then you would need to add the mean to your reconstructions, i.e. my_reconstruct += pca.mean_.

Why are my manual PCA reconstructions not matching python's sklearn's reconstructions?

1 Answers1