-1

I am trying to reduce the feature dimensions using PCA. I have been able to apply PCA to my training data, but am struggling to understand why the reduced feature set (X_train_pca) shares no similarities with the original features (X_train).

print(X_train.shape) # (26215, 727)
pca = PCA(0.5)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
print(X_train_pca.shape) # (26215, 100)

most_important_features_indicies = [np.abs(pca.components_[i]).argmax() for i in range(pca.n_components_)]
most_important_feature_index = most_important_features_indicies[0]

Should the first feature vector in X_train_pca not be just a subset of the first feature vector in X_train? For example, why doesn't the following equal True?

print(X_train[0][most_important_feature_index] == X_train_pca[0][0]) # False

Furthermore, none of the features from the first feature vector of X_train are in the first feature vector of X_train_pca:

for i in X_train[0]:
    print(i in X_train_pca[0])
# False
# False
# False
# ...
Espresso
  • 740
  • 13
  • 32

2 Answers2

1

PCA transforms your high dimensional feature vectors into low dimensional feature vectors. It does not simply determine the least important index in the original space and drop that dimension.

peer
  • 4,171
  • 8
  • 42
  • 73
  • So if use PCA on my offline training data and train a model with the reduced feature set, then during online inference I will be unable to transform the inference input? – Espresso Sep 27 '19 at 15:50
  • @SoftwareStudent123 PCA computes a transformation matrix from original space to reduced space. You use the same matrix to transform inference input to reduced space. – peer Sep 27 '19 at 16:04
  • Ahh, I understand now. My only remaining question is then how do I get this transformation matrix so that I can transform my inference input? – Espresso Sep 27 '19 at 17:08
  • @SoftwareStudent123 I think you can just use `pca.transform` again, that should apply the matrix for you. But I highly recommend that you read up on the math behind pca. – peer Sep 27 '19 at 17:44
1

This is normal since the PCA algorithm applies a transformation to your data:

PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. (https://en.wikipedia.org/wiki/Principal_component_analysis#Dimensionality_reduction)

Run the following code sample to see the effects the PCA algorithm on a simple Gaussian data set.

from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

pca = PCA(2)
X = np.random.multivariate_normal(mean=np.array([0, 0]), cov=np.array([[1, 0.75],[0.75, 1]]), size=(1000,))
X_new = pca.fit_transform(X)

plt.scatter(X[:, 0], X[:, 1], s=5, label='Initial data')
plt.scatter(X_new[:, 0], X_new[:, 1], s=5, label='Transformed data')
plt.legend()
plt.show()