68

How can I calculate Principal Components Analysis from data in a pandas dataframe?

benten
  • 1,995
  • 2
  • 23
  • 38
user3362813
  • 699
  • 1
  • 5
  • 3

2 Answers2

105

Most sklearn objects work with pandas dataframes just fine, would something like this work for you?

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

df = pd.DataFrame(data=np.random.normal(0, 1, (20, 10)))

pca = PCA(n_components=5)
pca.fit(df)

You can access the components themselves with

pca.components_ 
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • 28
    This works great. Just an addition that might be of interest: it's often convenient to end up with a DataFrame as well, as opposed to an array. To do that one would do something like: pandas.DataFrame(pca.transform(df), columns=['PCA%i' % i for i in range(n_components)], index=df.index), where I've set n_components=5. Also, you have a typo in the text above the code, "panadas" should be "pandas". :) – Moot Aug 03 '17 at 01:56
  • 4
    In my case I wanted the components, not the transform, so taking @Moot's syntax I used `df = pandas.DataFrame(pca.components_)`. One last note also, is that if you are going to try to use this new `df` with a dot product, make sure to check out this link: [https://stackoverflow.com/questions/16472729/matrix-multiplication-in-pandas/16473007] – rajan Oct 09 '19 at 03:43
8
import pandas
from sklearn.decomposition import PCA
import numpy
import matplotlib.pyplot as plot

df = pandas.DataFrame(data=numpy.random.normal(0, 1, (20, 10)))

# You must normalize the data before applying the fit method
df_normalized=(df - df.mean()) / df.std()
pca = PCA(n_components=df.shape[1])
pca.fit(df_normalized)

# Reformat and view results
loadings = pandas.DataFrame(pca.components_.T,
columns=['PC%s' % _ for _ in range(len(df_normalized.columns))],
index=df.columns)
print(loadings)

plot.plot(pca.explained_variance_ratio_)
plot.ylabel('Explained Variance')
plot.xlabel('Components')
plot.show()
NL23codes
  • 1,181
  • 1
  • 14
  • 31
  • The whiten=True argument to PCA does the normalization for you, if you need it at all. – leitungswasser Jun 02 '23 at 15:05
  • When in doubt, you normalize, otherwise you could have two different scales for your data. For example, if you had age in one column and population in another, those are two different measure scales and would need to be normalized in order to run PCA. – NL23codes Jun 03 '23 at 19:20