I have data with mix of continuous and categorical variables. I plan to one-hot encode the categorical variables, scale the dataset (mean=0, std=1) and then perform PCA to reduce number of dimensions. I need to know if I should similarly scale the one-hot encoded variables as well before doing PCA? I will be using python scikit-learn package for this.
Asked
Active
Viewed 4,142 times
2
-
When you say scale the dataset, do you mean the complete dataset, or only the columns which are not one-hot encoded? – Vivek Kumar May 22 '18 at 06:57
-
That is the question actually: should I scale only the continuous variables or the entire dataset (including the one-hot encoded variables)? – user828647 May 22 '18 at 09:09
-
Why do you want to scale the features **before** PCA? Is PCA senitive in the difference in scale of input features? I would imagine you would want to do scaling **after** PCA to make inputs digestible by ML models that rely on a distance measure – Mischa Lisovyi May 22 '18 at 09:14
-
1Maybe this can help:https://www.kaggle.com/general/21449 – Vivek Kumar May 22 '18 at 09:41
-
Thanks @VivekKumar. Sorry for extra confusion – Mischa Lisovyi May 22 '18 at 12:41
1 Answers
2
I think, this answer to a similar question on SO is relevant. Also there is a general discussion on StackExchange: https://stats.stackexchange.com/questions/5774/can-principal-component-analysis-be-applied-to-datasets-containing-a-mix-of-cont. However, it introduces a package in R only.
The only package in python, that I was able to find is this one: https://github.com/MaxHalford/prince. Note, that it is a private package, so one should not expect extensive support outside of maintainer's free time. Within this package FAMD is the relevant tool, which is at the moment under construction/debugging.

Mischa Lisovyi
- 3,207
- 18
- 29