2

I am trying to create a classification tree in R using the package tree.

This is an excerpt of the dataset I am using (header included):

CENTRO_EXAMEN,NOMBRE_AUTOESCUELA,MES,TIPO_EXAMEN,NOMBRE_PERMISO,PROB
Alcal· de Henares,17APTOV,5,PRUEBA DESTREZA,A2 ,0
Alcal· de Henares,17APTOV,5,PRUEBA CONDUCCION Y CIRCULACION,B  ,0.8
Alcal· de Henares,17APTOV,5,PRUEBA TEORICA,B  ,0.333333333
Alcal· de Henares,2000,5,PRUEBA TEORICA,B  ,0

and this is the commands I am issuing to R:

madrid=read.csv("madrid.csv",header=T,na.strings="?")
#madrid=na.omit(madrid)
names(madrid)
dim(madrid)
fix(madrid)
library(tree)
attach(madrid)

#costruisce albero
High=ifelse(PROB<=0.5,"No","Yes")
madrid=data.frame(madrid,High)
tree.madrid=tree(High~CENTRO_EXAMEN+NOMBRE_AUTOESCUELA+MES+TIPO_EXAMEN+NOMBRE_PERMISO,madrid)
summary(tree.madrid)
plot(tree.madrid)
text(tree.madrid,pretty=0)
tree.madrid

R returns the following error after issuing tree.madrid

Error in tree(High ~ CENTRO_EXAMEN + NOMBRE_AUTOESCUELA + MES + TIPO_EXAMEN +  : 
  factor predictors must have at most 32 levels

Any idea why?

user3161330
  • 249
  • 1
  • 7
  • 12

1 Answers1

2

Basically, it becomes computationally expensive to create so many splits in your data, since you are selecting the best split out of all 2^32 (approx) possible splits.

If you are able to use a random forest, Ben's comment here suggests that the randomForest can now handle up to 53 levels. If you cannot use a random forest for whatever reason, you can consider collapsing the levels of your categorical predictor.

Community
  • 1
  • 1
ZachTurn
  • 636
  • 1
  • 5
  • 14
  • I can use randomForest and I tried, apparently my datased produces more than 53 categorical predictors. Why is it so, in your opinion? Is it because of the number of different values that each variable can have? – user3161330 Jun 07 '16 at 19:27
  • @user3161330 Precisely. When you have a categorical variable, a level is a unique value. If you do `length(levels(data$factor_variable))` it will return how many levels are in your variable. This only works for factors though, if you wanted the number of distinct character values you could do `length(unique(data$character_variable))` – ZachTurn Jun 07 '16 at 20:39
  • Now I see... Any idea what could I do? I know you have no insight about the dataset, but is there any common technique to... reduce the number of values? Just get rid of the variable with too many values? – user3161330 Jun 08 '16 at 08:51
  • It's tricky. One way would be to apply some knowledge of the variable to reduce the number of levels(e.g. if you have geographic data, group into larger geographic zones). You could also screen the variable to see if there is much predictive power (perhaps fit a univariate logistic regression and gauge if the variable appears to be useful, and if not, perhaps remove it). – ZachTurn Jun 08 '16 at 13:48
  • You could also attempt to convert the categorical data to numeric data by doing something called Weight of Evidence coding. You can find more information about it [here](http://support.sas.com/resources/papers/proceedings13/095-2013.pdf). It isn't a perfect method but it can be a useful option – ZachTurn Jun 08 '16 at 13:51