0

When performing PCA on a dataset in Python, the explained_variance_ratio_ will show us the different variances for each feature in our dataset.

How do we know which columnn corresponds with which of the resulting variances?

Context: I'm working on a project and I need to know which components give us 90% of the variance with PCA so that we can perform stepwise feature selection later on.

from sklearn.decomposition import PCA
pcaObj = PCA(n_components=None)
X_train = pcaObj.fit_transform(X_train)
X_test = pcaObj.transform(X_test)
components_variance = pcaObj.explained_variance_ratio_
print(sum(components_variance))
print(components_variance)
redwytnblak
  • 143
  • 1
  • 1
  • 10
  • There is **not** a 1-to-1 correspondence between the PCs and the original features; all features contribute to every and each principal component. – desertnaut May 03 '20 at 10:04

2 Answers2

0

The pca.explained_variance_ratio_ parameter gives you an array of the variance of each dimension. Therefore, pca.explained_variance_ratio[i] will give you the variance of the i+1st dimesion.

I don't believe there is a way to match the variance with the 'name' of the column, but going through the variance array in a for loop and noting the index with 90% variance should allow you to then match the index with the column name.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
tersrth
  • 861
  • 6
  • 18
  • Thank you for the explanation. Is there a way to know the actual name of the column(s) that correspond to each variance supplied? – redwytnblak May 03 '20 at 04:23
  • As a follow up to your edit: do you know how we do that? Knowing the column number works fine for our purposes. – redwytnblak May 03 '20 at 04:33
0

Edit: I discovered similar question: Recovering features names of explained_variance_ratio_ in PCA with sklearn

The answers are richer and detailed explanations. I have marked this question as duplication but will leave this comment for time being.

I believe you can get the values with;

pd.DataFrame(pcaObj.components_.T, index=X_train.columns)

if X_train is not DataFrame but numpy, pass in the name of the features as they appeared originally as a list.

pd.DataFrame(pcaObj.components_.T, index=['column_a','column_b','column_c'], columns =['PC-1', 'PC-2'])

# column_x where the name of features

.componets_ should return the values you need. We can place them on Pandas pd, with columns names.

Prayson W. Daniel
  • 14,191
  • 4
  • 51
  • 57
  • thank you very much. I ran the line you provided and I got: 'numpy.ndarray' object has no attribute 'columns' I'm assuming I'd need to know the attribute of numpy arrays that correspond to columns. Would you happen to know that? No worries if not. Thank you! – redwytnblak May 03 '20 at 04:44
  • That is because your X_train is numpy and not DataFrame. Pass in the columns as a list then. You could also add columns=['PC-1',’PC-2'] – Prayson W. Daniel May 03 '20 at 04:46