0

I have a large data frame (~1400 rows) with the following columns:

    protein   IHD          CM         ARR         VD        CHD           CCD         VOO      
0   q9uku9  0.000000    0.039457    0.032901    0.014793    0.006614    0.006591    0.000000    
1   o75461  0.000000    0.005832    0.027698    0.000000    0.000000    0.006634    0.000000

etc.

I want to perform a PCA analysis and plot with the vectors, but I'm not sure how to do so with such a large data set. Does anyone have any suggestions?

Marissa P
  • 27
  • 4

1 Answers1

2

Actually a 1400 x 8 dataframe is not that big on modern computers. You can use scikit-learn to perform PCA on your dataset. It is relatively simple:

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
cols = ['IHD', 'CM', 'ARR', 'VD', 'CHD', 'CCD', 'VOO']
df = pd.DataFrame(np.random.random((1400, 7)), columns = cols)
pca = PCA(n_components=2)
pca.fit(df)
print(pca.components_)
print(pca.explained_variance_)

# [[-0.38406974  0.02775874 -0.59754361 -0.55464116 -0.03878488
#   -0.41944628 0.09795539]
#  [-0.03181143 -0.52699813  0.14325425  0.02742668 -0.48571934 
#   -0.33915335 0.590795  ]]
# [0.0913989  0.08975106]

You cannot plot the principal components, since they live in a 7-dimensional space. What you can do, as long as you keep the number of components less than three, is to plot the resulting dataset:

df2 = pd.DataFrame(pca.transform(df), columns = ['first', 'second'])
df2.plot.scatter(x = 'first', y = 'second')

enter image description here

As you can notice, I did not considered the column protein in doing PCA. The reason is that PCA works properly only with numerical column. See this discussion for some hints to handle categorical columns.

aprospero
  • 529
  • 3
  • 14
  • this makes a lot of sense, but how do I separate the variance for the individual categories? like if I wanted 8 different vectors in the plot? – Marissa P Mar 24 '21 at 20:58
  • Not sure I got your question. Do you want to plot the principal components? PCA is a dimensionality reduction technique, meaning that it reduces the number of dimensions of your dataset (i.e. the number of columns). To do that, it projects the data on some carefully chosen vectors, which are called principal components. The principal components are derived by identifying those directions which minimize the variance in the data. However, you cannot plot the principal components since they live in the original space dimension of your data. – aprospero Mar 24 '21 at 21:10
  • https://www.askpython.com/python/examples/principal-component-analysis like in this example, it seems as though there are multiple components plotted? Maybe I'm interpreting it incorrectly – Marissa P Mar 24 '21 at 21:15
  • Yes, in the first plot there are the two principal components plotted along with the data. However, observe that this can be done because the points live in a 2-dimensional space. So the principal components live in a 2-dimensional space as well. In your dataset instead the original data lives in a 8-dimensional space. So also the principal components live in a 8-dimensional space, which cannot be plotted. – aprospero Mar 24 '21 at 21:18