How can I calculate Principal Components Analysis from data in a pandas dataframe?
Asked
Active
Viewed 7.7k times
68

benten
- 1,995
- 2
- 23
- 38

user3362813
- 699
- 1
- 5
- 3
-
I guess you too are trying to modify the w3schools example :) – Sridhar Sarnobat Jul 23 '23 at 01:07
2 Answers
105
Most sklearn objects work with pandas
dataframes just fine, would something like this work for you?
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
df = pd.DataFrame(data=np.random.normal(0, 1, (20, 10)))
pca = PCA(n_components=5)
pca.fit(df)
You can access the components themselves with
pca.components_

Akavall
- 82,592
- 51
- 207
- 251
-
28This works great. Just an addition that might be of interest: it's often convenient to end up with a DataFrame as well, as opposed to an array. To do that one would do something like: pandas.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(n_components)], index=df.index), where I've set n_components=5. Also, you have a typo in the text above the code, "panadas" should be "pandas". :) – Moot Aug 03 '17 at 01:56
-
4In my case I wanted the components, not the transform, so taking @Moot's syntax I used `df = pandas.DataFrame(pca.components_)`. One last note also, is that if you are going to try to use this new `df` with a dot product, make sure to check out this link: [https://stackoverflow.com/questions/16472729/matrix-multiplication-in-pandas/16473007] – rajan Oct 09 '19 at 03:43
8
import pandas
from sklearn.decomposition import PCA
import numpy
import matplotlib.pyplot as plot
df = pandas.DataFrame(data=numpy.random.normal(0, 1, (20, 10)))
# You must normalize the data before applying the fit method
df_normalized=(df - df.mean()) / df.std()
pca = PCA(n_components=df.shape[1])
pca.fit(df_normalized)
# Reformat and view results
loadings = pandas.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_normalized.columns))],
index=df.columns)
print(loadings)
plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()

NL23codes
- 1,181
- 1
- 14
- 31
-
The whiten=True argument to PCA does the normalization for you, if you need it at all. – leitungswasser Jun 02 '23 at 15:05
-
When in doubt, you normalize, otherwise you could have two different scales for your data. For example, if you had age in one column and population in another, those are two different measure scales and would need to be normalized in order to run PCA. – NL23codes Jun 03 '23 at 19:20