0

I am using PCA() implementation contained in sklearn on a dataframe that has 200 features. This dataframe was created with this code:

df = data.pivot_table(index='customer', columns='purchase', values='amount', aggfunc=sum)
df = df.reset_index().rename_axis(None, axis=1)
df = df.fillna(value=0)

Then, I have implemented PCA():

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

pca = PCA(n_components=1)
p = pca.fit(df)
sum(pca.explained_variance_ratio_)

In the end, I have obtained the result presented below:

0.99999940944358268

Am I wrong, or is it generally illogical for this result to be practical when the number of components is set to 1 out of 200?


More Questions

  • In short, is my data actually only leaning to the one feature?
  • What could be causing this?
  • Does summing the values of the features for each customer prior to running PCA affect this?
  • How should I restructure my data to overcome this seeming error?
OverflowingTheGlass
  • 2,324
  • 1
  • 27
  • 75
  • 1
    https://stackoverflow.com/questions/22984335/recovering-features-names-of-explained-variance-ratio-in-pca-with-sklearn. I think this would be a great place to refer to what you have, which will give you a good understanding about PCA. – i.n.n.m Aug 08 '17 at 22:14
  • 1
    Thank you! Definitely an informative question - I learned a ton. I also noticed I didn't normalize/scale the data first (a good idea) or transform after fitting the model (I'm assuming, a must?). – OverflowingTheGlass Aug 08 '17 at 22:21
  • yes, you have to standardize data using `preprocessing` which is a requirement for many machine learning estimators in scikit-learn. I hate to copy and paste, hope you found what you need. – i.n.n.m Aug 08 '17 at 22:26

1 Answers1

1

You should read more about Principal Component Analysis in these sources:


Is it generally illogical for this result to be practical when the number of components is set to 1 out of 200?

It is possible to tweak the data with immense amount of features in a way that explained variance would be close to zero. To achieve that the features must be highly correlated between each other. In your case, I may assume two scenarios:

  • either there are a lot of missing values, as you fill them with zeros (not a state-of-the-art approach) which creates a spot for a higher relation;
  • either your data is really highly correlated, so PCA() well aggregates the information of the 200 features in a new feature.
  • either there is simply a problem with your data.

In short, is my data actually only leaning to the one feature?

What could be causing this?

As stated above PCA does not work with the original features as it creates new ones summarizing as much information as possible from the data. Thus, it does not actually lean to the one default feature.

I would suggest you to perform some data preprocessing as ~99% of explained variance ratio with 1 characteristic looks terribly suspicious. This could be caused by statements above.

Does summing the values of the features for each customer prior to running PCA affect this?

Any data manipulation affects the decomposition except certain cases like adding the same positive integer to a set of positive integers, and so on. You should apply PCA to your data prior and after the sum operation to observe the effect.

How should I restructure my data to overcome this seeming error?

First of all, I would suggest another approach to fulfill the data. You could insert the missing values column by column using a mean or a median. Secondly, you should understand what features actually mean and if it is possible to drop some of them before the decomposition. You could also implement scaling techniques and / or normalization techniques. But these should be usually tested prior and after the model fitting as they also affect the model metrics.

Community
  • 1
  • 1
E.Z
  • 1,958
  • 1
  • 18
  • 27
  • Thank you very much for the exhaustive answer. I definitely need to read/learn a LOT more. One quick question about the missing values - do you mean just imputing the mean or median for each column? I felt like 0 was the most accurate filler, because the data is amounts spent on particular items. So if a customer bought bread but didn't buy milk, I was thinking her milk value should be zero. There are a lot of zeros in my data because of this, because each customer only buys a few things out of the 200 possible options. – OverflowingTheGlass Aug 09 '17 at 01:52
  • Hm, okay. That could work then. However, if there are values which are missing for each customer, they should be erased at all. Imagine that no one has bought a milk, so the milk should be erased from your `DataFrame`. Or you could also aggregate products and create food categories, i. e. diary category and so on. Basically, it is okay to reduce the dimensionality of your data before applying the decomposition, provided you do it logically. – E.Z Aug 09 '17 at 18:48
  • makes sense - thank you! in this specific case, features won't appear unless at least one of the samples has a corresponding value. grouping is something I'l look into. is removing features based on % null (i.e. 90%) a valid approach as well? – OverflowingTheGlass Aug 09 '17 at 18:50
  • It depends on the amount of customers you have. 90% may be alright if you still have lots of data to be trained on. You should calibrate the threshold based on this fact. – E.Z Aug 09 '17 at 19:05