confusion marix creating issue

Question

fit <- rpart(unacc~., data = carTrain, method = 'class')

I have created the decision tree on carTrain.

and prediction on

predict_unseen <- predict(fit,carTest, type = 'class')

here carTest is unseen data to predict

now I am creating a confusion matrix

confusionMatrix(carTest$unacc,predict_unseen)

I am getting the error

confusionMatrix(carTest$unacc,predict_unseen)

Error in confusionMatrix.default(carTest$unacc, predict_unseen) : the data cannot have more levels than the reference

This is a methodology issue and is not off topic on SO. The error message is pretty clear: you have levels in your test set that are not included in your training set. Your model cannot account for outcomes that it has not seen. You should use stratified sampling to select your training sample to assure that all outcome levels are included. — lmo, May 04 '19 at 11:46
set.seed(3456) trainIndex <- createDataPartition(car_data$unacc, p = .7, list = FALSE, times = 1) carTrain <- car_data[ trainIndex,] carTest <- car_data[-trainIndex,] — CrazyPanda, May 04 '19 at 12:14
You need to stratify your partitions. Not sure if this is available in `caret`, but you can check the documentation of that function to see if there is a stratify argument. — lmo, May 04 '19 at 12:17
Please provide a fully reproducible example. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — bbiasi, May 04 '19 at 18:39

score 0 · Answer 1 · answered May 04 '19 at 19:11

library(rpart)
library(imptree)
data(carEvaluation)

table(carEvaluation$acceptance)

> table(carEvaluation$acceptance)

  acc  good unacc vgood 
  384    69  1210    65

Note that unacc is just one of the categories within the acceptance attribute.

So you can do something like this:

{set.seed(3456)
  train <- caret::createDataPartition(carEvaluation$acceptance, p = .8, # partition 80%~20%
                                      list = FALSE)
  carTrain <- carEvaluation[train,]
  carTest  <- carEvaluation[-train,]
  fit <- rpart::rpart(acceptance~., data = carTrain, method = 'class')
}
df <- data.frame(obs = carTest$acceptance,
                 predict(fit, newdata = carTest, type = "class"))
cfm <- caret::confusionMatrix(df$predict.fit..newdata...carTest..type....class.., df$obs)
cfm

> cfm
Confusion Matrix and Statistics

          Reference
Prediction acc good unacc vgood
     acc    70    0    10     2
     good    5   12     1     0
     unacc   1    0   231     0
     vgood   0    1     0    11

Overall Statistics

               Accuracy : 0.9419          
                 95% CI : (0.9116, 0.9641)
    No Information Rate : 0.7035          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.8762          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: acc Class: good Class: unacc Class: vgood
Sensitivity              0.9211     0.92308       0.9545      0.84615
Specificity              0.9552     0.98187       0.9902      0.99698
Pos Pred Value           0.8537     0.66667       0.9957      0.91667
Neg Pred Value           0.9771     0.99693       0.9018      0.99398
Prevalence               0.2209     0.03779       0.7035      0.03779
Detection Rate           0.2035     0.03488       0.6715      0.03198
Detection Prevalence     0.2384     0.05233       0.6744      0.03488
Balanced Accuracy        0.9381     0.95248       0.9724      0.92157

You do not necessarily need to make your code exactly as it is exemplified here. I suggest looking at the documentation of the caret package and the rpart for code enhancement. Or you can provide a fully reproducible example.

confusion marix creating issue

1 Answers1