R RF unbalanced classes low negative predicted value on unseen data compared to train

Question

I have built a Random Forest model for predicting if a customer is doing operations regarding to fraud or not. It is a large an a quite unbalanced sample, with 3% cases of fraud, and I want to predict the minority class (fraud).

I balance the data (50% each) and build the RF. So far, I have a good model with an overall accuracy of ~80% and a +70% fraud predicted correctly. But when I try the model on unseen data (test), although the overall accuracy is good, the negative predicted value (fraud) is really low compared to the training data (13% only vs +70%).

I have tried increasing the sample size, increasing the balanced categories, tuning RF parameters, ..., but none of them have worked well, with similar results. Am I overfitting somehow? What can I do to improve fraud detection (negative predicted value) on unseen data?

Here is the code and results:

set.seed(1234)

#train and test sets
model <- sample(nrow(dataset), 0.7 * nrow(dataset))
train <- dataset[model, ]
test <- dataset[-model, ]
    #Balance the data
balanced <- ovun.sample(custom21_type ~ ., data = train, method = "over",p = 0.5, seed = 1)$data

table(balanced$custom21_type)

   0    1 
5813 5861

#build the RF
rf5 = randomForest(custom21_type~.,ntree = 100,data = balanced,importance = TRUE,mtry=3,keep.inbag=TRUE)
rf5

Call:
 randomForest(formula = custom21_type ~ ., data = balanced, ntree = 100,      importance = TRUE, mtry = 3, keep.inbag = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 3

        OOB estimate of  error rate: 21.47%
Confusion matrix:
     0    1 class.error
0 4713 1100   0.1892310
1 1406 4455   0.2398908

#test on unseen data
predicted <- predict(rf5, newdata=test)
confusionMatrix(predicted,test$custom21_type)
Confusion Matrix and Statistics

          Reference
Prediction     0     1
         0 59722   559
         1 13188  1938

               Accuracy : 0.8177          
                 95% CI : (0.8149, 0.8204)
    No Information Rate : 0.9669          
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.1729          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.8191          
            Specificity : 0.7761          
         Pos Pred Value : 0.9907          
         Neg Pred Value : 0.1281          
             Prevalence : 0.9669          
         Detection Rate : 0.7920          
   Detection Prevalence : 0.7994          
      Balanced Accuracy : 0.7976          

       'Positive' Class : 0

Perhaps have a validation set before running on test data and train the model using only the train set? It seems you're training on both train and test set. and using only 30% to finally "test" — NelsonGon, Jan 23 '19 at 11:57
Thank you for the comment. I have a dataset of 251.356 rows, and created a train set of 70% and another one of 30% for test. I have changed test to 50% and results remain similar. I do not understand why you say "you're training on both train and test set". Could you please be more specific? — ecp, Jan 23 '19 at 12:12
It seems you combined the train and test set. Unless of course you split the trainset into train and validation(named test). Combination of train and test(at least in my opinion) is done to make data "pre processing" easier. If you indeed combined train and test, "unbind" them. Use train alone and test later on test. — NelsonGon, Jan 23 '19 at 12:14
Sorry for my ignorance, I am quite new to this. Could you provide an example of how to proceed? Thanks again. — ecp, Jan 23 '19 at 13:20
It's hard to. I could use iris but then that might not help. What is contained in `dataset`? — NelsonGon, Jan 23 '19 at 13:22
"dataset" contains all the data: 251.356 rows, more than 20 columns (categorical variables). — ecp, Jan 23 '19 at 13:28
Btw have you tried tuning mtry and number of trees? The discussion is getting long and I can unfortunately not help(not sure about the data). Try tuning the different parameters and/or feature engineering. — NelsonGon, Jan 23 '19 at 13:29
Yes, I have tuned mtry, ntrees and so long, but the problem persists. — ecp, Jan 23 '19 at 13:33
Good luck. I hope someone with more experience can quickly figure out and help. — NelsonGon, Jan 23 '19 at 13:34
You have `ovun.sample(..., data = train2...)` but you do not define `train2` anywhere — dww, Jan 25 '19 at 23:04
It was a typo here, does not affect the results, changed, thanks. — ecp, Jan 28 '19 at 07:53
[See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making a reproducible example that is easier to help with. You can post a sample of data that represents the issue you're having--otherwise we're just guessing — camille, Jan 28 '19 at 13:08
Also this seems to be a "model-tuning" problem, not a coding one. Probably should be moved over to CrossValidated. https://stats.stackexchange.com/ — RLave, Jan 29 '19 at 08:04
"but none of them have worked well, with similar results"..what have you done regarding feature engineering? Have you tried with different models (SVM, NN,XGB..) or better ensemble learning? Have you tried rebalancing with SMOTE? Have you done something to measure feature importance? Removing not useful variables helps with the overfit problem. I suggest to start learning what all of this means. Head over to kaggle.com and find a similar problem, they have kernels where they show you with code how it's done :) — RLave, Jan 29 '19 at 08:07
I have tried rebalancing with SMOTE, rebalancing "under" and "over", and SVM approach. I removed some variables but it is a good idea to remove more of them. I will try NN, XGB and other approaches and remove variables. Also I am moving this to [stats.stackexchange.com](http://stats.stackexchange.com) — ecp, Jan 29 '19 at 08:10
I'd start by reading here https://www.kaggle.com/mlg-ulb/creditcardfraud/kernels. But know that there simply isn't a solution that works well on two different problems, but it might get you somewhere. — RLave, Jan 29 '19 at 08:13

Maxwell Chandler · Accepted Answer · 2019-01-30T17:18:22.980

First I notice that you are not using any cross validation. Including this will help add variation in the data used to train and will help reduce overfitting. Additionally we are going to user C.50 in place of randomForest because it is more robust and gives more penalties to type 1 errors.

One thing you may consider is actually not having a 50-50 balance split in the train data, but making it more 80-20. This is so that the underbalanced class is not over sampled. I am sure this is leading to overfitting and the failure for your model to classify novel examples as negative.

RUN THIS AFTER YOU CREATE THE RE-BALANCED DATA (p=.2)

library(caret)
#set up you cross validation
Control <- trainControl(
summaryFunction = twoClassSummary, #displays model score not confusion matrix
classProbs = TRUE, #important for the summaryFunction
verboseIter = TRUE, #tones down output
savePredictions = TRUE, 
method = "repeatedcv", #repeated cross validation, 10 folds, 3 times
repeats = 3,
number = 10,
allowParallel = TRUE

)

Now I read in the comments that all your variables are categorical. This is optimal for NaiveBayes algorithms. However if you have any numerical data you will need to preprocess (scale, normalize, and NA input) as is standard procedure. We are also going to implement a grid-searching process.

IF YOUR DATA IS ALL CATEGORICAL

model_nb <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "nb", 
trControl = Control,
tuneGrid = data.frame(fL=c(0,0.5,1.0), usekernel = TRUE, 
adjust=c(0,0.5,1.0)))

IF YOU WOULD LIKE A RF APPROACH (make sure to preprocess if data is numeric)

model_C5 <- train(
x = balanced[,-(which(colnames(balanced))%in% "custom21_type")],
y= balanced$custom21_type,
metric = "ROC",
method = "C5.0", 
trControl = Control,
tuneGrid = tuneGrid=expand.grid(.model = "tree",.trials = c(1,5,10), .winnow = F)))

Now we predict

C5_predict<-predict(model_C5, test, type = "raw")
NB_predict<-predict(model_nb, test, type = "raw")
confusionMatrix(C5_predict,test$custom21_type)
confusionMatrix(nb_predict,test$custom21_type)

EDIT:

try adjusting the cost matrix below. What this one does is penalize type two errors twice as bad as type one errors.

cost_mat <- matrix(c(0, 2, 1, 0), nrow = 2)
rownames(cost_mat) <- colnames(cost_mat) <- c("bad", "good")
cost_mod <- C5.0( x = balanced[,-(which(colnames(balanced))%in% 
"custom21_type")],
y= balanced$custom21_type,
             costs = cost_mat)
summary(cost_mod)

EDIT 2:

predicted <- predict(rf5, newdata=test, type="prob")

will give you the actual probabilities for each prediction. The default cut-off is .5. I.e. everything above .5 will get classified as 0 and everything below as 1. So you can adjust this cutoff to help with unbalanced classes.

ifelse(predicted[,1] < .4, 1, predicted[,1])

C5 gives more accuracy, 94.5% overall, still not enough: 25% of positive predicted values. NB does not work, gives warnings (+50) and cannot run the confusion matrix. `Error: 50: In FUN(X[[i]], ...) : Numerical 0 probability for all classes with observation 94` — ecp, Jan 29 '19 at 07:38
Ok see edit, that might do something for you. Notice that it does not use the train objects we defined before. — Maxwell Chandler, Jan 29 '19 at 19:28
Your proposal works, but results look more or less the same. I will take further look at the data and try other methods. — ecp, Jan 30 '19 at 14:21
And how can I apply it to the model? `predictedrf <- predict(rf_fit, newdata=test, type="prob") RF_predict_cutoff<-ifelse(predictedrf[,1] < .4, 1, predictedrf[,1]) confusionMatrix(RF_predict_cutoff,test$custom21_type)` just fails: "Error: `data` and `reference` should be factors with the same levels." — ecp, Jan 31 '19 at 08:02
You are trying to do a confusion matrix on a numeric variable and a factor variable. Make sure the `ifelse` makes a categorical variable like `"1"` instead of the number 1 so it matches whats actually in `custom21_type`. I can only provide general code because I dont have your data. But what comes out of the `ifelse` has to be the same as whats in `custom21_type` — Maxwell Chandler, Jan 31 '19 at 17:52
So if `custom21_type` is a factor with levels `0` and `1` then the `ifelse` has to produce the same type of class. I set it to produce a `numeric` 1, but only because I cant see your data. — Maxwell Chandler, Jan 31 '19 at 18:00
if `custom21_type` is a factor with levels `correct` and `false` then the ifelse has to produce the `correct` and `false` too. Make sense? — Maxwell Chandler, Jan 31 '19 at 18:05
Yes, I have checked and applied it, works good with slightly better results. — ecp, Feb 01 '19 at 10:34

R RF unbalanced classes low negative predicted value on unseen data compared to train

1 Answers1