2

I have data with mix of continuous and categorical variables. I plan to one-hot encode the categorical variables, scale the dataset (mean=0, std=1) and then perform PCA to reduce number of dimensions. I need to know if I should similarly scale the one-hot encoded variables as well before doing PCA? I will be using python scikit-learn package for this.

user828647
  • 505
  • 1
  • 10
  • 27
  • When you say scale the dataset, do you mean the complete dataset, or only the columns which are not one-hot encoded? – Vivek Kumar May 22 '18 at 06:57
  • That is the question actually: should I scale only the continuous variables or the entire dataset (including the one-hot encoded variables)? – user828647 May 22 '18 at 09:09
  • Why do you want to scale the features **before** PCA? Is PCA senitive in the difference in scale of input features? I would imagine you would want to do scaling **after** PCA to make inputs digestible by ML models that rely on a distance measure – Mischa Lisovyi May 22 '18 at 09:14
  • 1
    Maybe this can help:https://www.kaggle.com/general/21449 – Vivek Kumar May 22 '18 at 09:41
  • Thanks @VivekKumar. Sorry for extra confusion – Mischa Lisovyi May 22 '18 at 12:41

1 Answers1

2

I think, this answer to a similar question on SO is relevant. Also there is a general discussion on StackExchange: https://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont. However, it introduces a package in R only.

The only package in python, that I was able to find is this one: https://github.com/MaxHalford/prince. Note, that it is a private package, so one should not expect extensive support outside of maintainer's free time. Within this package FAMD is the relevant tool, which is at the moment under construction/debugging.

Mischa Lisovyi
  • 3,207
  • 18
  • 29