-1

My data set contains continuous variables(car price, Odometer), orderly variables(car age, sale year), and categorical variables(manufacturer, country, color). My target is to build a model to predict cars' sale prices in the second-hand market.

I encoded all categorical variables into dummy variables. Now my data set includes many 0-1 variables(100+), so I plan to reduce the dimensionalities to speed up. Now, my problem is which methods I should use?

My first choice is PCA. However, it is a method for continuous variables, so I should use it when my set contains many dummy variables.

My second choice is CATPCA. To be honest, I have no idea about this method but it seems can be used for my set but I don't know how to implement it in Python.

My third idea is to split my set and use different methods. For example, I split my set into two sets: the continuous variable set and the dummy variable set. use PCA on the continuous set and CATPCA on the dummy set. Then combinate two sets. But I have no theory to support this idea.

The final idea is to try all the above methods and choose the best one on the validation set. However, according to this question, even though the PCA method gets the best result, it is less meaningful. So, it seems the validation performance cannot be a good indicator for my question.

So, what is the dimensionality reduction method I should use on my dataset?

Carlos
  • 167
  • 1
  • 2
  • 14
  • Don't split the datasets, it will affect the weights of the predictions. You can use an advanced technique called FAMD. – Baraa Zaid May 17 '22 at 03:46

1 Answers1

0

*While you can use PCA on binary data or categorical data (e.g. one-hot encoded data) that does not mean it is a good thing, or it will work very well.

PCA is designed for continuous variables. It tries to minimize variance (=squared deviations). The concept of squared deviations breaks down when you have binary variables.

So yes, you can use PCA. And yes, you get an output. It even is a least-squared output: it's not as if PCA would segfault on such data. It works, but it is just much less meaningful than you'd want it to be; and supposedly less meaningful than e.g. frequent pattern mining.*