Principal components analysis using pandas dataframe

Question

How can I calculate Principal Components Analysis from data in a pandas dataframe?

I guess you too are trying to modify the w3schools example :) — Sridhar Sarnobat, Jul 23 '23 at 01:07

Akavall · Answer 1 · 2017-08-03T03:13:33.923

105

Most sklearn objects work with pandas dataframes just fine, would something like this work for you?

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

df = pd.DataFrame(data=np.random.normal(0, 1, (20, 10)))

pca = PCA(n_components=5)
pca.fit(df)

You can access the components themselves with

pca.components_

edited Aug 03 '17 at 03:13

answered Apr 25 '14 at 00:42

Akavall

82,592
51
207
251

28

This works great. Just an addition that might be of interest: it's often convenient to end up with a DataFrame as well, as opposed to an array. To do that one would do something like: pandas.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(n_components)], index=df.index), where I've set n_components=5. Also, you have a typo in the text above the code, "panadas" should be "pandas". :) – Moot Aug 03 '17 at 01:56
4

In my case I wanted the components, not the transform, so taking @Moot's syntax I used `df = pandas.DataFrame(pca.components_)`. One last note also, is that if you are going to try to use this new `df` with a dot product, make sure to check out this link: [https://stackoverflow.com/questions/16472729/matrix-multiplication-in-pandas/16473007] – rajan Oct 09 '19 at 03:43

NL23codes · Answer 2 · 2021-08-04T21:29:16.393

8

import pandas
from sklearn.decomposition import PCA
import numpy
import matplotlib.pyplot as plot

df = pandas.DataFrame(data=numpy.random.normal(0, 1, (20, 10)))

# You must normalize the data before applying the fit method
df_normalized=(df - df.mean()) / df.std()
pca = PCA(n_components=df.shape[1])
pca.fit(df_normalized)

# Reformat and view results
loadings = pandas.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_normalized.columns))],
index=df.columns)
print(loadings)

plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()

edited Aug 04 '21 at 21:29

answered Aug 01 '21 at 21:36

NL23codes

1,181
1
14
31

The whiten=True argument to PCA does the normalization for you, if you need it at all. – leitungswasser Jun 02 '23 at 15:05
When in doubt, you normalize, otherwise you could have two different scales for your data. For example, if you had age in one column and population in another, those are two different measure scales and would need to be normalized in order to run PCA. – NL23codes Jun 03 '23 at 19:20

Principal components analysis using pandas dataframe

2 Answers2

Linked