-1

I'm trying to train a RF model in R, but when i try to define the model:

rf <- randomForest(labs ~ .,data=as.matrix(dd.train))

It gives me the error:

Error in randomForest.default(m, y, ...) :
  Can not handle categorical predictors with more than 53 categories.

Any idea what could it be?

And no, before you say "You have some categoric variable with more than 53 categories". No, all variables but labs are numeric.

Tim Biegeleisen: Read the last line of my question and you will see why is not the same as the one you are linking!

Ghost
  • 1,426
  • 5
  • 19
  • 38
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Sounds like your data might not be properly encoded. – MrFlick Jan 20 '20 at 16:43
  • use `forcats::fct_lump()` to reduce number of factors – jyjek Jan 20 '20 at 16:44
  • @jyjek The factor variable only has 26 categories. – Ghost Jan 20 '20 at 16:46
  • @MrFlick Data is properly encoded; 256 `numeric` matrix columns and one `categoric` (26 levels) column (labs in this case in the formula). All triple checked for any kind of random misclass assignments. I can't -and won't- provide a "working example" since i'm using a very large data input (which is confidential btw) and the question is not in the form of "I'm in A and i don't know how to reach B". I suspect that maybe the problem is that there are too many numeric columns in the formula and it's -erroneously- outputting it as a too-many-categories error. – Ghost Jan 20 '20 at 16:54

1 Answers1

3

Edited to address followup from OP

I believe using as.matrix in this case implicitly creates factors. It is also not necessary for this packages. You can keep it as a data frame, but will need to make sure that any unused factor levels are dropped by using droplevels (or something similar). There are many reasons an unused factor may be in your data set, but a common one is a dropped observation.

Below is a quick example that reproduces your error:

library('randomForest')

#making a toy data frame
x <- data.frame('one' = c(1,1,1,1,1,seq(50) ),
       'two' = c(seq(54),NA),
       'three' = seq(55),
       'four' = seq(55) )

x$one <- as.factor(x$one)

x <- na.omit(x) #getting rid of an NA. Note this removes the whole row.

randomForest(one ~., data = as.matrix(x)) #your first error
randomForest(one ~., data = x) #your second error

x <- droplevels(x)

randomForest(one ~., data = x) #OK
Peter_Evan
  • 947
  • 10
  • 17
  • Without the matrix coercion for the `data` arg. it gives this error `Error in randomForest.default(m, y, ...) : Can't have empty classes in y.` – Ghost Jan 20 '20 at 17:18
  • Already tried it, not working either.. but i just realized something, when i split into the train/test subsets, in the train data there are not cases of every category of the y variable in the formula, you think that could be the problem? – Ghost Jan 20 '20 at 17:40
  • I found the problem!! In the factor column there were whitespaces sometimes in the category names, and those ones, somehow; turned into NULL values (no idea why, i found out by getting a `table` of the labs column). I fixed it with a `gsub(" ",".",labs)`. – Ghost Jan 20 '20 at 17:46
  • I'm gonna mark your answer as the accepted thou, since it can be a good guide for other people dealing with the same issues in rf. – Ghost Jan 20 '20 at 19:11