31

I can perform PCA in scikit by code below: X_train has 279180 rows and 104 columns.

from sklearn.decomposition import PCA
pca = PCA(n_components=30)
X_train_pca = pca.fit_transform(X_train)

Now, when I want to project the eigenvectors onto feature space, I must do following:

""" Projection """
comp = pca.components_ #30x104
com_tr = np.transpose(pca.components_) #104x30
proj = np.dot(X_train,com_tr) #279180x104 * 104x30 = 297180x30

But I am hesitating with this step, because Scikit documentation says:

components_: array, [n_components, n_features]

Principal axes in feature space, representing the directions of maximum variance in the data.

It seems to me, that it is already projected, but when I checked the source code, it returns only the eigenvectors.

What is the right way how to project it?

Ultimately, I am aiming to calculate the MSE of reconstruction.

""" Reconstruct """
recon = np.dot(proj,comp) #297180x30 * 30x104 = 279180x104

"""  MSE Error """
print "MSE = %.6G" %(np.mean((X_train - recon)**2))
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
HonzaB
  • 7,065
  • 6
  • 31
  • 42

2 Answers2

59

You can do

proj = pca.inverse_transform(X_train_pca)

That way you do not have to worry about how to do the multiplications.

What you obtain after pca.fit_transform or pca.transform are what is usually called the "loadings" for each sample, meaning how much of each component you need to describe it best using a linear combination of the components_ (the principal axes in feature space).

The projection you are aiming at is back in the original signal space. This means that you need to go back into signal space using the components and the loadings.

So there are three steps to disambiguate here. Here you have, step by step, what you can do using the PCA object and how it is actually calculated:

  1. pca.fit estimates the components (using an SVD on the centered Xtrain):

     from sklearn.decomposition import PCA
     import numpy as np
     from numpy.testing import assert_array_almost_equal
    
     #Should this variable be X_train instead of Xtrain?
     X_train = np.random.randn(100, 50)
    
     pca = PCA(n_components=30)
     pca.fit(X_train)
    
     U, S, VT = np.linalg.svd(X_train - X_train.mean(0))
    
     assert_array_almost_equal(VT[:30], pca.components_)
    
  2. pca.transform calculates the loadings as you describe

     X_train_pca = pca.transform(X_train)
    
     X_train_pca2 = (X_train - pca.mean_).dot(pca.components_.T)
    
     assert_array_almost_equal(X_train_pca, X_train_pca2)
    
  3. pca.inverse_transform obtains the projection onto components in signal space you are interested in

     X_projected = pca.inverse_transform(X_train_pca)
     X_projected2 = X_train_pca.dot(pca.components_) + pca.mean_
    
     assert_array_almost_equal(X_projected, X_projected2)
    

You can now evaluate the projection loss

loss = np.sum((X_train - X_projected) ** 2, axis=1).mean()
Make42
  • 12,236
  • 24
  • 79
  • 155
eickenberg
  • 14,152
  • 1
  • 48
  • 52
  • Ok, so I can call `pca.fit` to calculate the components, then the projection can be calculated by `pca.fit_transform` (that is also when I want to work further with the data - fetch them to some model since the dimensionality is reducted). And for reconstruction, I call `pca.invert_transform` to calculate MSE. Is that correct? – HonzaB Apr 12 '16 at 09:00
  • 1
    It depends on what you mean by projection. First, note that `pca.fit_transform(X)` gives the same result as `pca.fit(X).transform(X)` (it is an optimized shortcut). Second, a projection is generally something that goes from one space into the same space, so here it would be from signal space to signal space, with the property that applying it twice is like applying it once. Here it would be `f= lambda X: pca.inverse_transform(pca.transform(X))`. You can check that `f(f(X)) == f(X).` So I would call that the projection. `pca.transform` is obtaining the loadings. In the end it's just terminolgy – eickenberg Apr 12 '16 at 09:06
  • By projection I mean transforming the vectors onto feature space. It is what I did in my question (second step) and it is the same what `pca.transform(X)` does - result is matrix _Mxk_, where _M_ is number of rows and _k_ number of selected components. I will use this as input to models (and i should expect better results than using original dataset) – HonzaB Apr 12 '16 at 09:17
  • Then use the pipeline. `from sklearn.pipeline import make_pipeline` then `pipeline = make_pipeline(PCA(n_components=30), your_classifier)` and you can use it as your own classifier. However, note that a projection is a well defined operation mathematically and it may be better to use it that way to avoid misunderstanding: https://en.wikipedia.org/wiki/Projection_%28linear_algebra%29 – eickenberg Apr 12 '16 at 09:55
  • You are right, I get my terminology together for next time. And use the pipeline. Thanks for help. – HonzaB Apr 12 '16 at 10:01
  • Forget what I said about the projection - both 2 and 3 are projections, just represented in different ways. – eickenberg Apr 12 '16 at 13:54
  • 2
    super awesome explanatory answer – WestCoastProjects Feb 16 '18 at 20:48
  • 3
    Just wanted to say that `assert_array_almost_equal(VT[:30], pca.components_)` is not always true. In the implementation of PCA the signs are shuffled around between U and V. To mimic this shuffling replace `U, S, VT = np.linalg.svd(Xtrain - Xtrain.mean(0))` by `U, S, VT = np.linalg.svd(Xtrain - Xtrain.mean(0), full_matrices=False)` and insert `from sklearn.utils.extmath import svd_flip` followed by `U, VT = svd_flip(U, VT)`. – Stan Jan 31 '19 at 16:19
  • 1
    Does `X_train` in `loss = ((X_train - X_projected) ** 2).mean()` replace `Xtrain` variable defined earlier in the code? – Josmoor98 Feb 28 '19 at 17:50
  • Yeah that was a typo. Feel free to edit (probably Xtrain -> X_train would be best) – eickenberg Mar 01 '19 at 00:12
  • Thanks eickenberg. Great explanation as well, helped me out massively! – Josmoor98 Mar 01 '19 at 11:45
  • Model answer in every sense – jtlz2 Jun 09 '21 at 12:22
  • I think the last part of the answer is wrong: The squared loss should be not `((data_projected - data_candidate) ** 2).mean()`. Instead, one should first calculate the distance per data point, then take the square of each distance, take the sum of those and divide by the number of data point. This can be done, e.g., via `np.sum((data_projected - data_candidate) ** 2, axis=1).mean()`. The answer here is - in effect - additionally dividing by the dimension of the data as well. – Make42 Jun 29 '21 at 15:27
  • How to get the inverse transform when whitening=True? I expect there must be some multiplicaton with the eigenvalues or so? – Isi Feb 08 '23 at 09:58
5

Adding on @eickenberg's post, here is how to do the pca reconstruction of digits' images:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn import decomposition

n_components = 10
image_shape = (8, 8)

digits = load_digits()
digits = digits.data

n_samples, n_features = digits.shape
estimator = decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True)
digits_recons = estimator.inverse_transform(estimator.fit_transform(digits))

# show 5 randomly chosen digits and their PCA reconstructions with 10 dominant eigenvectors
indices = np.random.choice(n_samples, 5, replace=False)
plt.figure(figsize=(5,2))
for i in range(len(indices)):
    plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits[indices[i],:], image_shape)), plt.axis('off')
plt.suptitle('Original', size=25)
plt.show()
plt.figure(figsize=(5,2))
for i in range(len(indices)):
    plt.subplot(1,5,i+1), plt.imshow(np.reshape(digits_recons[indices[i],:], image_shape)), plt.axis('off')
plt.suptitle('PCA reconstructed'.format(n_components), size=25)
plt.show()

enter image description here

enter image description here

Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63