0

I am using naive bayes to classify my observations into 3 classes: S1, S2 and S3, depending on the value of the variable SC_3ans. However, it seems to always classify them into S1 and never in S2, which it should. As you can see in the confusion matrix, 0 observations have been classed in S2. I tried to change the size of the testing set, it hasn't changed anything. How can I fix this ?

set. Seed(2)
id <- sample(2, nrow(Data), prob = c(0.7,0.3), replace = T)
Datatrain <- Data[id==1,]
Datatest <- Data[id==2,]

library(e1071)
library(caret)
y <- Datatrain$SC_3ans_segment
x <- Datatrain[, names(Datatrain) %in% c("TYPE_CONTRACTUALISATION","WEBSERVICE_MANUEL","REGION","RENFORT","ECO","TRANCHE_ANC_2021","GAR_PRODUIT","TRANCHE_AGE","SOU_GRP_SITUATION_FAMILLE","REGIME","Type_Distribution","PTF_2022","GAR_FORMULE_GROUPE")]
Data_nb_model <- caret::train(x,y,'nb',trControl=trainControl(method='cv',number=10))
Test_model <- predict(object=Data_nb_model, newdata=Datatest)
confusionMatrix(table(Test_model, Datatest$SC_3ans_segment))

This is the output:

Confusion Matrix and Statistics

    Test_model    S1    S2    S3
    S1 10349  1023  4913
    S2     0     0     0
    S3  1637   231  1492

 Overall Statistics
                                      
           Accuracy : 0.6027          
             95% CI : (0.5959, 0.6096)
No Information Rate : 0.6101          
P-Value [Acc > NIR] : 0.9833          
                                      
              Kappa : 0.094           
                                      
Mcnemar's Test P-Value : <2e-16          

Statistics by Class:

                 Class: S1 Class: S2 Class: S3
Sensitivity             0.8634   0.00000   0.23294
Specificity             0.2250   1.00000   0.85891
Pos Pred Value          0.6355       NaN   0.44405
Neg Pred Value          0.5128   0.93617   0.69831
Prevalence              0.6101   0.06383   0.32604
Detection Rate          0.5268   0.00000   0.07595
Detection Prevalence    0.8290   0.00000   0.17104
Balanced Accuracy       0.5442   0.50000   0.54593
Jia Hannah
  • 95
  • 7
  • Looks like the model is highly overfitted to class S1? How was the data used for train? Are the classes unbalanced so number of S1 >>> number S2 & S3? – RobertoT Aug 10 '22 at 08:48
  • @RobertoT yes, in the database there are 39475 observations that should belong to S1, 4154 in S2 and 21325 in S3. – Jia Hannah Aug 10 '22 at 08:55
  • This is not a duplicate but I recently found [this](https://stackoverflow.com/questions/41488279/neural-network-always-predicts-the-same-class) helpful for similar problems with a model in Python. Although the code is in Python and it's mostly about neural networks, some of the debugging methods apply to any language/model (e.g. create a dataset of only one data point of class i and see what the model predicts). – SamR Aug 10 '22 at 08:55
  • @SamR @SamR when i test the model with a set of only S2 observations : `Data_S2 <- subset(Data, Data$SC_3ans_segment=="S2") Data_nb_model2 <- caret::train(y=Data_S2$SC_3ans_segment,x=Data_S2[, names(Data_S2) %in% c("TYPE_CONTRACTUALISATION","WEBSERVICE_MANUEL","REGION","RENFORT","ECO","TRANCHE_ANC_2021","GAR_PRODUIT","TRANCHE_AGE","SOU_GRP_SITUATION_FAMILLE","REGIME","Type_Distribution","PTF_2022","GAR_FORMULE_GROUPE")],'nb',trControl=trainControl(method='cv',number=10)) Data_nb_model2`. I get this error: Something is wrong; all the Accuracy metric values are missing:`Error: Stopping` – Jia Hannah Aug 10 '22 at 09:05
  • @JiaHannah I'm not familiar with naive bayes in R. Do you get that error if you try to train it on one class of a built-in dataset like `iris`? If not it might be a clue... – SamR Aug 10 '22 at 09:08
  • This is probably an issue around applying default decision thresholds. When using `predict` try specifying that you want probabilities. The syntax varies by model, but it's typically done using something like `type="response"`. You can then apply a more suitable decision threshold to your predicted probabilities to get classifications that suit your data. – rw2 Aug 10 '22 at 09:25
  • @rw2 I tried it, nothing changed – Jia Hannah Aug 10 '22 at 09:46

1 Answers1

3

There are two problems at work here that you need to think about.

1.) Is this a real classification or am I interested in the likelihood of any given item being in class 1-3

2.) How to deal with highly imbalanced data.

1.) In general the output of most classification models is actually not what you should be interested in. Most models do not predict something to be in class x or y they say how likely they think something is to be in class x and how likely it is to be in class y, they then use a cutoff value to decide what class the item should land in.

If you have three classes most models will default to a likelyhood of >30% as the cutoff. However of course this doesn't make sense for classes that naturally only occur at a <30% chance in reality. If a model would predict it is 40% likely that something is class 2 which only occurs naturally at a 20% rate, the "uplift" of its prediction would be 2X, so that is more remarkable than predicting a 60% likelihood for a class that occurs at a 70% chance.

Chose your cutoff values depending on your domain and your confusion matrix will soon start to look very different.

2.) In some cases imbalance maybe so grave as to prevent any model from performing even after inspecting the probability output. In this case you should try to resample using approaches like SMOTE.

Fnguyen
  • 1,159
  • 10
  • 23