-1

I'm trying to make a random forest with the following

movies.rf <- randomForest(Infl.Adj.Dom.BoxOffice~. -Genre -Source -ProductionMethod -CreativeType, data=Movies, subset=train)

I get

Error in randomForest.default(m, y, ...) : Can not handle categorical predictors with more than 53 categories.

After reading this I tried to check the values of my variables and got this

>length(unique(Movies$Genre))
[1] 12
> length(unique(Movies$Source))
[1] 16
> length(unique(Movies$ProductionMethod))
[1] 5
> length(unique(Movies$CreativeType))
[1] 9

Individually, none of them is greater than 53, and added together, they are less than 53. So why do I still get the error?

divibisan
  • 11,659
  • 11
  • 40
  • 58
Person
  • 1
  • 1
  • Maybe there's another variable that you think is numeric but was inadvertently imported as a factor. – joran May 03 '18 at 18:41
  • Yes you provided too little info. Examples generally have to be reproducible (see the MCVE requirement) but even that aside we can't even see the variables in the model since you used a `.` on the right hand side – Hack-R May 03 '18 at 18:59
  • OK I'll tell you what to do. Re-run it with ONLY the variables you listed in this question as predictors. I will bet 1000 to 1 that the error goes away. So just inspect your data to find the bad column you've thrown in. – Hack-R May 03 '18 at 19:01

1 Answers1

0

If, as it seems from the context of your question, you intend to use only these four features (Genre, Source, ProductionMethod, CreativeType) in order to predict Infl.Adj.Dom.BoxOffice, then you are using the R formula in a wrong way: your usage

Infl.Adj.Dom.BoxOffice~. -Genre -Source -ProductionMethod -CreativeType

in fact says "predict Infl.Adj.Dom.BoxOffice using all features (.) except Genre, Source, ProductionMethod, CreativeType" (the - symbol is used for excluding variables).

So, what actually happens here, is that one (or more) of your other features is a categorical one with more than 53 levels.

The correct usage, if indeed you want to use only these four features you mention, should be:

movies.rf <- randomForest(Infl.Adj.Dom.BoxOffice ~ Genre + Source + ProductionMethod + CreativeType, data=Movies, subset=train)
desertnaut
  • 57,590
  • 26
  • 140
  • 166