I have built a Random Forest model for predicting if a customer is doing operations regarding to fraud or not. It is a large an a quite unbalanced sample, with 3% cases of fraud, and I want to predict the minority class (fraud).
I balance the data (50% each) and build the RF. So far, I have a good model with an overall accuracy of ~80% and a +70% fraud predicted correctly. But when I try the model on unseen data (test), although the overall accuracy is good, the negative predicted value (fraud) is really low compared to the training data (13% only vs +70%).
I have tried increasing the sample size, increasing the balanced categories, tuning RF parameters, ..., but none of them have worked well, with similar results. Am I overfitting somehow? What can I do to improve fraud detection (negative predicted value) on unseen data?
Here is the code and results:
set.seed(1234)
#train and test sets
model <- sample(nrow(dataset), 0.7 * nrow(dataset))
train <- dataset[model, ]
test <- dataset[-model, ]
#Balance the data
balanced <- ovun.sample(custom21_type ~ ., data = train, method = "over",p = 0.5, seed = 1)$data
table(balanced$custom21_type)
0 1
5813 5861
#build the RF
rf5 = randomForest(custom21_type~.,ntree = 100,data = balanced,importance = TRUE,mtry=3,keep.inbag=TRUE)
rf5
Call:
randomForest(formula = custom21_type ~ ., data = balanced, ntree = 100, importance = TRUE, mtry = 3, keep.inbag = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3
OOB estimate of error rate: 21.47%
Confusion matrix:
0 1 class.error
0 4713 1100 0.1892310
1 1406 4455 0.2398908
#test on unseen data
predicted <- predict(rf5, newdata=test)
confusionMatrix(predicted,test$custom21_type)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 59722 559
1 13188 1938
Accuracy : 0.8177
95% CI : (0.8149, 0.8204)
No Information Rate : 0.9669
P-Value [Acc > NIR] : 1
Kappa : 0.1729
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8191
Specificity : 0.7761
Pos Pred Value : 0.9907
Neg Pred Value : 0.1281
Prevalence : 0.9669
Detection Rate : 0.7920
Detection Prevalence : 0.7994
Balanced Accuracy : 0.7976
'Positive' Class : 0