0

My data resource:https://www.kaggle.com/mlg-ulb/creditcardfraud The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions,enter image description hereenter image description here

I was using the PRROC package to get AUC of ROC curve, here is my random forest code:

rf.model <- randomForest(Class ~ ., data = training, ntree = 2000, nodesize = 20)
rf_pred <- predict(rf.model, test,type="prob" 

so, as expected, rf_pred should return the probability of each class : enter image description here Then, i used the following code:

fg_rf <- rf_pred[test$Class==1]
bg_rf <- rf_pred[test$Class==0]
roc_rf <- roc.curve(scores.class0 = fg_rf,scores.class1 = bg_rf,curve = T)

However, the ROC CURVE turned out to be not what as i expected enter image description here The same problem occurred for PR curve. Is it because of high imbalance in class? And assuming rf_pred returns the probability of 0/1, how can i let fg_rf equals to the probability of calss=1, is my code:fg_rf <- rf_pred[test$Class==1] correct?

kyle chan
  • 353
  • 1
  • 3
  • 12
  • We could check it out if you made a reproducible example :-) What is the data? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – RobertMyles Feb 20 '18 at 18:44
  • Hi, my data has high imbalance in class, 0:199021 ; 1:345. Should i balance the class in training data before use this training data to train model right? – kyle chan Feb 21 '18 at 00:02

1 Answers1

1

Looking at your head(rf_pred) results, it is obvious that your predict function returns (hard) classes (i.e. 0/1), and not probability scores, probably due to your type="pro" typo (it should be type="prob").

The scores.class0 & scores.class1 arguments of the roc.curve method should be probability scores, and not hard class predictions.

Correct the typo in predict and you should be fine, but most probably you need to also switch the scores - as they are now you are assigning your class 1 points to scores.class0:

rf_pred <- predict(rf.model, test,type="prob")
fg_rf <- rf_pred[test$Class==1]
bg_rf <- rf_pred[test$Class==0]
roc_rf <- roc.curve(scores.class0 = bg_rf, scores.class1 = fg_rf, curve = T)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • Thank you for pointing out, but actually i used type="prob" still have the same result, could you please help me check it out? – kyle chan Feb 20 '18 at 23:34
  • @kylechan Your `rf_pred` certainly does not seem to contain probability scores; and did you switch `scores.class0` & `scores.class1` as I show in my snippet? Do you actually use the example from the package vignette? – desertnaut Feb 20 '18 at 23:42
  • Also i tried switch bg_rf and fg_rf, still AUC is 0.5 – kyle chan Feb 20 '18 at 23:48
  • I have a theory: my data has high class imbalance, 0 :199021 : 1:345 , so actually i used SMOTE function to re-balance the class in training data, and i should use this balanced class training data to train the random forest instead of the original imbalanced one,right? – kyle chan Feb 21 '18 at 00:00
  • Hi, just updated my question, maybe it can help you better understand it – kyle chan Feb 21 '18 at 00:24
  • @kylechan Class imbalance changes everything (you should have included this info in the question). Given that, your AUC is not strange - and yes, you should use the SMOTE-balanced dataset to train your classifier – desertnaut Feb 21 '18 at 09:54