dealing with imbalanced data after encoding for classification

Question

I have a data of dimension (13961,48 ) initially, and after one hot encoding and also basic massaging of data the dimension observed around (13961,862). the data is imbalance with two categories of 'Retained' around 6% and 'not Retained' around 94%.

While running any algorithms such as logistic,knn,decision tree,random forest, the data results in very high accuracy even without any feature selection process carried out and the accuracy crosses more than 94% mostly except 'Naive bias classifier'.

This seems like odd and even by having any two features randomly also--> that gives accuracy more than 94% , which seems non reality in general.

Applying SMOTE also, provide result of more than 94% of accuracy even for baseline model of any algorithms said above such as logistic,knn,decision tree,random forest,

After removing the top 20 features also , this gives accuracy of good result more than 94% ( checked for understanding the genuineness )

 g = data[Target_col_Y_name]
 df = pd.concat([g.value_counts(), 
            g.value_counts(normalize=True).mul(100)],axis=1, keys=('counts','percentage'))

print('The % distribution between the retention and non-retention flag\n')

print (df)

# The code o/p to show the imbalance is

 The % distribution between the retention and non-retention flag

              counts  percentage
Non Retained   13105   93.868634
Retained         856    6.131366

My data have 7 numerical variables such as month, amount, interest rate and all others ( around 855) as one-hot-encoding transformed categorical variables.

Any methodology , to handle this kind of data on baseline,feature selection or imbalance optimization techniques ? please guide by looking at the dimensionality and the imbalance count for each levels.

Accuracy is not an appropriate evalution metric metric for (highly) imbalanced datasets because the results are determined by the majority class while minority class is ignored. In your case think about the constant classifier that for any input, the predicted output is the `Non Retained` class. What is the accuracy? — Georgios Douzas, Aug 21 '19 at 12:55

deonardo_licaprio · Accepted Answer · 2022-06-24T07:23:47.363

I would like to add something in addition to Elias answer.

Firstly, you have to understand that even if you's create "dumb classifier", which always predicts "not retained", you'd still be correct 94% of times. So accuracy is clearly weak metric in this case.

You should definitely learn about confusion matrix and metrics that come along with it (like AUC).

One of these metrics is F1 score, which is harmonic average of precision and recall. It is better that accuracy in imbalanced class setting, but... it doesn't have to be the best. F1 will favor these classifiers that have similar precision and recall. But this is not necessary something that is important for you.

For instance, if you'd build sfw content filter, you would be ok with labeling some SFW content as nsfw (negative class), which would increase false negative rate (and decrease recall), but you would like to be sure that you kept only safe ones (high precision).

In your case you can reason what is worse: retaining something bad or abandoning something good, and pick the metric in that way.

As far as strategy is concerned: there are plenty of ways to handle class imbalance: sampling techniques (try down-sampling, up-sampling besides SMOTE or ROSE) and check out whether your validation score (training metrics alone are almost useless) improved. Just remember to apply sampling/augmentations techniques after the train-validation split.

Moreover, some models have special hyperparametrs to focus more on rare class (for instance in xgboost there is scale_pos_weight parameter). From my experience, tunning this hyperparam is way more effective than SMOTE.

Good luck

# deonardo_licaprio, thanks for the detail explanation. – Ayyasamy Aug 26 '19 at 03:17 — Ayyasamy, Aug 26 '19 at 03:17

score 2 · Answer 2 · answered Aug 24 '19 at 00:02

Accuracy is not a very good measure in general, particularly for imbalanced classes. I would recommend this other stackoverflow answer, that explains when to use F1 score and when to use AUROC, which are both far better measures than accuracy; in this case F1 is better.

Few points just to clear up:

For models such as random forest, you should not have to remove features to improve the accuracy, as it will just regard them as insignificant features. I recommend random forests as it tends to be very accurate (except in some cases) and can show significant features just by using clf.feature_significances_ (if using the scipy random forest).
Decision trees will almost always perform worse than random forests, as random forests are many aggregated decision trees.

dealing with imbalanced data after encoding for classification

2 Answers2