0

I am building a decision tree classification model. All of my feature variables and label variable are in factor type. When I split my data set into training and testing sets, the two subsets will contain unused levels. If I dropped levels on the two subsets, the predictive results will be very different and the accuracy will be mush lower.

I am wondering what is the proper way to deal this level issue, in the circumstance of predictive modeling, as well as other situations. Any suggestion?

Here is an reproducible example using sample data solder in rpart package. I choose Solder as my label variable. It is an balanced data set.

solder_data<-solder

##split training data set and test data set
set.seed(11)
g <- runif(nrow(solder_data))#set random order of data set
solder_data<- solder_data[order(g),]
ss <- sample(1:nrow(solder_data),size = 0.7*nrow(solder_data))
solder.Train <- solder_data[ss,]
solder.Test <- subset(solder_data[-ss,],Opening=='S')
dl_solder.Test <-droplevels(solder.Test) # drop unused levels in testing set

str(solder.Test) #opening has 3 levels
str(droplevels(solder.Test)) # opening has 1 level

#build model
library(RevoScaleR)
rxfit <- rxDTree(Solder ~ Opening + skips + Mask + PadType + Panel,
             data = solder.Train)

#test model on test set before dropping levels
rxpred <- rxPredict(rxfit,data = solder.Test,extraVarsToWrite = "Solder")
rxpred$Predicted <- ifelse(rxpred$Thick_prob<=rxpred$Thin_prob, 
                    "Thin","Thick")
mean(rxpred$Predicted!=rxpred$Solder) # misclassification rate is 0.1428571

#test model on test set after dropping levels
rxpred_dl <- rxPredict(rxfit,data = dl_solder.Test,
              extraVarsToWrite   = "Solder")
rxpred_dl$Predicted <- ifelse(rxpred_dl$Thick_prob<=rxpred_dl$Thin_prob, 
                        "Thin","Thick")
mean(rxpred_dl$Predicted!=rxpred_dl$Solder) 
# misclassification rate is 0.3714286

Why it leads to different predicted results after dropping unused levels in test data set? Which one is the right way to do prediction?

Y.Li
  • 43
  • 5
  • please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Adam Quek May 02 '17 at 03:45
  • The train set should contain a representative sample. Hence, any level of the predictive covariates should have an adequate number of cases inside. If one or more of these categories are empty in the training set, you cannot get a prediction for cases that falls in these categories in the testing set. – Marco Sandri May 02 '17 at 13:47
  • @MarcoSandri I believe my training set contains all the levels that in the testing set – Y.Li May 08 '17 at 05:41
  • @AdamQuek Here is an example using sample data **solder** in `rpart` package. Due to characters limit in comments, I will post it in my question. – Y.Li May 08 '17 at 05:44
  • After reading more related question's answers, I think I should simply my question as: If I split my training set and testing set from a single data set, should I drop unused levels before training and validating my model? – Y.Li May 11 '17 at 11:23

0 Answers0