0

I’m working on a classification model that has 7 predictors and about 100,000 observations.

My problem is that 5 of the predictors are factor variables that have hundreds of levels each.

I know that there are limitations on the number of levels in some algorithms, such as random forest. When I tried to fit the model with random forest through the caret library I got an error massage:

Cannot handle categorical predictors with more than 53 categories.

I’ve tried some methods to bypass this limitation like one-hot encoding and sparse.model.matrix but it didn’t work, usually because of insufficient memory on my machine for trying to turn 7 predictors to 2,000 predictors.

So my question, is it possible to use these factor levels successfully in predictive algorithms? I don’t want to group the levels down to 53 levels as it will lose me too much data.

Any advice would be much appreciated

guybrush
  • 41
  • 4
  • Yes it's possible. `sparse.model.matrix` is the way to go. I'm surprised you ran into a memory issue. It's worth exploring that further in this thread... for the "Can I do this / how / when?" type questions, please refer to some similar threads on Cross-Validated SE – MichaelChirico Feb 17 '19 at 13:09
  • @MichaelChirico Actually you are right. `sparse.model.matrix` ran fine. I just didn't succeed in feeding it to the model fit function. Can you please specify more on how to use `sparse.model.matrix` to solve the problem? – guybrush Feb 17 '19 at 13:24
  • 2
    See here https://stackoverflow.com/questions/3169371/large-scale-regression-in-r-with-a-sparse-feature-matrix – MichaelChirico Feb 17 '19 at 13:37

0 Answers0