0

I have a large data set, which I reduced applying gsub multiple times, basically in this form:

levels(Orders$Im) <- gsub("Osp", "OsProf", levels(Orders$Im))

I also used rbind:

DI_Reduced <- rbind(CX, OsP)

I need to apply function "tree" to the resulting data.frame, but I get an error:

tree.model <- tree(line ~ CountryCode + OrderType + Support, data=train.set)

The error is:

Error in tree(line ~ CountryCode + OrderType + Support,  : 
  factor predictors must have at most 32 levels

Strange thing: if I export the train.set with write.csv and then I re-import it with read.csv, the error disappears and the tree is built. I investigated the structure of the train.set and this is the difference before and after exporting/importing it:

$ CustomerNumber: Factor w/ 4616 levels "0","101959","210285",..: 3070 3069 4539 3732 2573 3086 2973 3817 4056 2956 ...
 $ CountryCode        : Factor w/ 4 levels "OtherCountry",..: 3 3 4 4 3 3 3 4 4 3 ...
 $ OrderType          : Factor w/ 5 levels "Order","NewOrder",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ Support   : Factor w/ 5 levels "#N/A","BN",..: 4 4 4 4 2 4 4 4 4 4 ...
 $ Manuf      : Factor w/ 163 levels "<Generic>","6gi",..: 52 52 52 52 52 52 52 52 52 52 ...
 $ line       : Factor w/ 623 levels "\"Generic\" Skews",..: 400 35 400 400 400 400 400 400 400 400 ...
 ________________________________________________________________
 
  $ CustomerNumber: Factor w/ 692 levels "201500","20202",..: 361 360 680 499 138 367 315 523 592 304 ...
 $ CountryCode        : Factor w/ 2 levels "JP","US": 1 1 2 2 1 1 1 2 2 1 ...
 $ OrderType          : Factor w/ 1 level "Online": 1 1 1 1 1 1 1 1 1 1 ...
 $ Support   : Factor w/ 4 levels "BN","MC",..: 3 3 3 3 1 3 3 3 3 3 ...
 $ Manuf      : Factor w/ 1 level "DY": 1 1 1 1 1 1 1 1 1 1 ...
 $ line       : Factor w/ 2 levels "CX","OTX": 2 1 2 2 2 2 2 2 2 2 ...

It seems to me that gsub does not really subsect the original data.frame, and the hidden values stay in the train.set till I export/import the train. Is there another way to do this operation and obtain a tree?

gmt
  • 93
  • 8
  • 2
    There is no `gsub` in the code showed. It is not clear. Perhaps you need `droplevels(train.set)` – akrun Dec 10 '17 at 11:47
  • 1
    `gsub` does not substract, it just finds a pattern according to a regex applied. Please show the way you subset the data through a [minimal working example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – Roman Luštrik Dec 10 '17 at 11:54
  • I updated my post, if that is not enough I will write it more extensively but it is quite a long code. BTW I understood that gsub does not substract, but I would like to find an alternative to export the file as a .csv and re-import it to obtain the same result, if possible. – gmt Dec 10 '17 at 11:57

1 Answers1

0

As the error says, your dependent variable line has more than 32 levels. As per your train.set structure line : Factor w/ 623 levels

Try using other tree libraries like rpart.

Refactoring after subset might help.

sapply(train.set, {function(x) if(class(x) == "factor") {factor(x)}})

Also, gsub is not used for subsetting usually. It is global substitution function. You should share the pre-processing steps followed as well to help others help you with this better.

Deepak Sadulla
  • 373
  • 2
  • 12