High AUC but bad predictions with imbalanced data

Question

I am trying to build a classifier with LightGBM on a very imbalanced dataset. Imbalance is in the ratio 97:3, i.e.:

Class

0    0.970691
1    0.029309

Params I used and the code for training is as shown below.

lgb_params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric':'auc',
        'learning_rate': 0.1,
        'is_unbalance': 'true',  #because training data is unbalance (replaced with scale_pos_weight)
        'num_leaves': 31,  # we should let it be smaller than 2^(max_depth)
        'max_depth': 6, # -1 means no limit
        'subsample' : 0.78
    }

# Cross-validate
cv_results = lgb.cv(lgb_params, dtrain, num_boost_round=1500, nfold=10, 
                    verbose_eval=10, early_stopping_rounds=40)

nround = cv_results['auc-mean'].index(np.max(cv_results['auc-mean']))
print(nround)

model = lgb.train(lgb_params, dtrain, num_boost_round=nround)


preds = model.predict(test_feats)

preds = [1 if x >= 0.5 else 0 for x in preds]

I ran CV to get the best model and best round. I got 0.994 AUC on CV and similar score in Validation set.

But when I am predicting on the test set I am getting very bad results. I am sure that the train set is sampled perfectly.

What parameters are needed to be tuned.? What is the reason for the problem.? Should I resample the dataset such that the highest class is reduced.?

pls include also a (short) sample of your data & predictions, as well as any info related to the class imbalance — desertnaut, Jul 05 '18 at 12:13

score 11 · Accepted Answer · edited Jun 20 '20 at 09:12

The issue is that, despite the extreme class imbalance in your dataset, you are still using the "default" threshold of 0.5 when deciding the final hard classification in

preds = [1 if x >= 0.5 else 0 for x in preds]

This should not be the case here.

This is a rather big topic, and I strongly suggest you do your own research (try googling for threshold or cut off probability imbalanced data), but here are some pointers to get you started...

From a relevant answer at Cross Validated (emphasis added):

Don't forget that you should be thresholding intelligently to make predictions. It is not always best to predict 1 when the model probability is greater 0.5. Another threshold may be better. To this end you should look into the Receiver Operating Characteristic (ROC) curves of your classifier, not just its predictive success with a default probability threshold.

From a relevant academic paper, Finding the Best Classification Threshold in Imbalanced Classification:

2.2. How to set the classification threshold for the testing set

Prediction results are ultimately determined according to prediction probabilities. The threshold is typically set to 0.5. If the prediction probability exceeds 0.5, the sample is predicted to be positive; otherwise, negative. However, 0.5 is not ideal for some cases, particularly for imbalanced datasets.

The post Optimizing Probability Thresholds for Class Imbalances from the (highly recommended) Applied Predictive Modeling blog is also relevant.

Take home lesson from all the above: AUC is seldom enough, but the ROC curve itself is often your best friend...

On a more general level regarding the role of the threshold itself in the classification process (which, according to my experience at least, many practitioners get wrong), check also the Classification probability threshold thread (and the provided links) at Cross Validated; key point:

the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.

Thanks for the detailed answer. I will check those resources. I have also tried different thresholds before but the predictions were not that good. — Sreeram TP, Jul 06 '18 at 04:24
@SreeramTP Sure, a 97:3 imbalance is almost by definition a tough problem, not amenable to easy or straightforward solutions. But arguably you now have learned something new, i.e. that the choice of the threshold itself is an issue... — desertnaut, Jul 06 '18 at 09:26
yeah I have. I was exploring the possibilities of XGboost on this data. It produces similar AUC on CV and on the same validation set. The test predictions are much better than the `lgb` predictions even when I keep threshold at `0.5` — Sreeram TP, Jul 06 '18 at 09:53
@SreeramTP yes, the classifier performance itself is an issue orthogonal to the threshold choice (i.e. it's always better to start from more accurate predictions) — desertnaut, Jul 06 '18 at 10:06

High AUC but bad predictions with imbalanced data

1 Answers1

Linked