XGBoost for multiclassification and imbalanced data

Question

I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.

I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?

I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.

Observations:

I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help). But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.

Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible. My observation above shows that this can work for the binary case.

First, try to balance data with "Oversampling and Undersampling" techniques, then you can classification with normal distribution on balanced data. — Pooya Chavoshi, Jun 07 '21 at 09:23
@PooyaChavoshi Thanks for the comment. I should have added that I prefer to avoid using over/undersampling techniques and approaches like SMOTE as much as possible. I've tried them however — Pooyan Moradifar, Jun 07 '21 at 09:59

Prakash Dahal · Accepted Answer · 2021-06-07T13:59:53.923

12

sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.

This code should work for multiclass data:

from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
    class_weight='balanced',
    y=train_df['class'] #provide your own target name
)

xgb_classifier.fit(X, y, sample_weight=sample_weights)

edited Jun 07 '21 at 13:59

answered Jun 07 '21 at 10:19

Prakash Dahal

4,388
2
11
25

1

Thanks for the suggestion. It did work! The key point I was missing was that the parameter `sample_weight` should be passed while fitting. – Pooyan Moradifar Jun 07 '21 at 12:10
2

What's the difference between scale_pos_weight and sample_weight? – skan Jul 15 '22 at 23:30

score 1 · Answer 2 · answered Aug 15 '22 at 12:58

You can use sample_weight as @Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data). If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way: xgb_class.fit(X_train, y_train, sample_weight=weights)

XGBoost for multiclassification and imbalanced data

2 Answers2