I am building a decision tree classification model. All of my feature variables and label variable are in factor type. When I split my data set into training and testing sets, the two subsets will contain unused levels. If I dropped levels on the two subsets, the predictive results will be very different and the accuracy will be mush lower.
I am wondering what is the proper way to deal this level issue, in the circumstance of predictive modeling, as well as other situations. Any suggestion?
Here is an reproducible example using sample data solder in rpart
package. I choose Solder as my label variable. It is an balanced data set.
solder_data<-solder
##split training data set and test data set
set.seed(11)
g <- runif(nrow(solder_data))#set random order of data set
solder_data<- solder_data[order(g),]
ss <- sample(1:nrow(solder_data),size = 0.7*nrow(solder_data))
solder.Train <- solder_data[ss,]
solder.Test <- subset(solder_data[-ss,],Opening=='S')
dl_solder.Test <-droplevels(solder.Test) # drop unused levels in testing set
str(solder.Test) #opening has 3 levels
str(droplevels(solder.Test)) # opening has 1 level
#build model
library(RevoScaleR)
rxfit <- rxDTree(Solder ~ Opening + skips + Mask + PadType + Panel,
data = solder.Train)
#test model on test set before dropping levels
rxpred <- rxPredict(rxfit,data = solder.Test,extraVarsToWrite = "Solder")
rxpred$Predicted <- ifelse(rxpred$Thick_prob<=rxpred$Thin_prob,
"Thin","Thick")
mean(rxpred$Predicted!=rxpred$Solder) # misclassification rate is 0.1428571
#test model on test set after dropping levels
rxpred_dl <- rxPredict(rxfit,data = dl_solder.Test,
extraVarsToWrite = "Solder")
rxpred_dl$Predicted <- ifelse(rxpred_dl$Thick_prob<=rxpred_dl$Thin_prob,
"Thin","Thick")
mean(rxpred_dl$Predicted!=rxpred_dl$Solder)
# misclassification rate is 0.3714286
Why it leads to different predicted results after dropping unused levels in test data set? Which one is the right way to do prediction?