0

I have a dataset with 23 rows and 48 columns. I am applying PCA to reduce the number of column dimensions. I use the following codes examples and I see that only 23 are required features:

#first
import numpy as np
from sklearn.decomposition import PCA
pca = PCA().fit(only_features)
plt.figure(figsize=(15,8))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

#second
df_pca = pca.fit_transform(X=only_features)
df_pca = pd.DataFrame(df_pca)
print(df_pca.shape)

However, I would want to know which are the features required. Like for example: If the original dataset had columns A-z and reduced by PCA, then I would want to know which are the features selected.

How to do that?

Thanks for help

Ken White
  • 123,280
  • 14
  • 225
  • 444
K C
  • 413
  • 4
  • 15

1 Answers1

0

Credit to this answer1 & answer2, Sklearn's documentation states that the number of components retained when you don't specify the n_components parameter is min(n_samples, n_features). So min(23, 48) = 23 that's why you required 23 in your case.

Solution 1: if you use Sklearn library credit to this answer

  • check variance of PCs by: pca.explained_variance_ratio_
  • check importance of PCs by: print(abs( pca.components_ ))
  • using customized function to extract more info about PCs see this answer.

Solution 2: if you use PCA library documenetation

# Initialize
model = pca()
# Fit transform
out = model.fit_transform(X)

# Print the top features. The results show that f1 is best, followed by f2 etc
print(out['topfeat'])

#     PC      feature
# 0  PC1      f1
# 1  PC2      f2
# 2  PC3      f3
# 3  PC4      f4
# 4  PC5      f5
...

Even you can make a plot of PCs by: model.plot()

img

Mario
  • 1,631
  • 2
  • 21
  • 51