Scale one-hot encoded variables for PCA

Question

I have data with mix of continuous and categorical variables. I plan to one-hot encode the categorical variables, scale the dataset (mean=0, std=1) and then perform PCA to reduce number of dimensions. I need to know if I should similarly scale the one-hot encoded variables as well before doing PCA? I will be using python scikit-learn package for this.

When you say scale the dataset, do you mean the complete dataset, or only the columns which are not one-hot encoded? — Vivek Kumar, May 22 '18 at 06:57
That is the question actually: should I scale only the continuous variables or the entire dataset (including the one-hot encoded variables)? — user828647, May 22 '18 at 09:09
Why do you want to scale the features **before** PCA? Is PCA senitive in the difference in scale of input features? I would imagine you would want to do scaling **after** PCA to make inputs digestible by ML models that rely on a distance measure — Mischa Lisovyi, May 22 '18 at 09:14

score 2 · Answer 1 · answered Aug 09 '18 at 09:29

I think, this answer to a similar question on SO is relevant. Also there is a general discussion on StackExchange: https://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont. However, it introduces a package in R only.

The only package in python, that I was able to find is this one: https://github.com/MaxHalford/prince. Note, that it is a private package, so one should not expect extensive support outside of maintainer's free time. Within this package FAMD is the relevant tool, which is at the moment under construction/debugging.

Scale one-hot encoded variables for PCA

1 Answers1