3

I was trying to check my implementation of PCA to see if I understood it and I tried to do PCA with 12 components on the MNIST data set (which I got using the tensorflow interface that normalized it for me). I obtained the principal components given by sklearn and then made reconstructions as follow:

pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train)
# do manual PCA
U = pca.components_
my_reconstruct = np.dot(  U.T , np.dot(U, X_train.T) ).T

then I used the reconstruction interface given by sklearn to try to reconstruct as follow:

pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train)
X_reconstruct = pca.inverse_transform(X_pca)

and then checked the error as follow (since the rows are a data point and columns features):

print 'X_recon - X_my_reconstruct', (1.0/X_my_reconstruct.shape[0])*LA.norm(X_my_reconstruct - X_reconstruct)**2
#X_recon - X_my_reconstruct 1.47252586279

the error as you can see is non-zero and actually quite noticeable. Why is it? How is their reconstruction different from mine?

ali_m
  • 71,714
  • 23
  • 223
  • 298
Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
  • Never implemented PCA myself, but did you consider looking into the sources. Their operations is very different [link](https://github.com/scikit-learn/scikit-learn/blob/51a765a/sklearn/decomposition/pca.py) (which gives a hint about different algorithmics / internal data-structures). – sascha Aug 06 '16 at 02:20

1 Answers1

2

I see a couple of issues:

  1. The dot product should be X_pca.dot(pca.components_). PCA factorizes your X_train matrix using SVD:

    Xtrain = U·S·Vᵀ.

    Here, pca.components_ corresponds to Vᵀ (a (k, n_features) matrix), not U (an (n_datapoints, k) matrix).

    The sklearn implementation of PCA is quite readable, and can be found here. I also wrote a pure numpy example in this previous answer.

  2. Did you center X_train by subtracting the mean value for each column before doing the fitting?

    The PCA class automatically centers your data and stores the original mean vector in its .mean_ attribute. If the mean vector for your input features was nonzero then you would need to add the mean to your reconstructions, i.e. my_reconstruct += pca.mean_.

Community
  • 1
  • 1
ali_m
  • 71,714
  • 23
  • 223
  • 298