0

I was reading this post Recovering features names of explained_variance_ratio_ in PCA with sklearn and I wanted to understand the output of the following line of code:

pd.DataFrame(pca.components_, columns=subset.columns)

First, I thought that pca components from sklearn would be how much of the variance is explained by each feature (I guess this is the interpretation of PCA, right?). However, I think that this is actually wrong, and the explained variance is given by pca.explained_variance.

Also, the ouput of the dataframe constructed with the script above is very confused to me, because it has several lines and there are also negative numbers.

Furthemore, how does the dataframe constructed above relates to the following plot:

plt.bar(range(pca.explained_variance_), pca.explained_variance_)

I'm really confused about the PCA components and the variance.

If some example is needed, we might build PCA with iris dataset. This is what I've done so far:

subset = iris.iloc[:, 1:5]
scaler = StandardScaler()
pca = PCA()

pipe = make_pipeline(scaler, pca)
pipe.fit(subset)

# Plot the explained variances
features = range(pca.n_components_)
_ = plt.bar(features, pca.explained_variance_)

# Dump components relations with features:
pd.DataFrame(pca.components_, columns=subset.columns)
dsbr__0
  • 241
  • 1
  • 3

1 Answers1

0

In PCA, the components (in sklearn, the components_) are linear combinations between the original features, enhancing their variance. So, their are vectors that combine the input features, in order to maximize the variance.

In sklearn, as referenced here, the components_ are presented in order of their explained variance (explained_variance_), from the highest to the lowest value. So, the i-th vector of components_ has the i-th value of explained_variance_.

A useful link on PCA: https://online.stat.psu.edu/stat505/lesson/11

nunohpinheiro
  • 2,169
  • 13
  • 14