How to avoid overfitting on multiclass classification OvR Xgboost model / class_weight in Python?

Question

I try to build multiclass classification model in Python using XGBoost OvR (OneVsRest) like below:

from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(abt.drop("TARGET", axis=1)
                                                                        , abt["TARGET"]
                                                                        , train_size = 0.70
                                                                        , test_size=0.30
                                                                        , random_state=123
                                                                        , stratify = abt["TARGET"])

model_1 = OneVsRestClassifier(XGBClassifier())

When I used above code I have HUGE overfitting: AUC_TRAIN: 0.9988, AUC_TEST: 0.7650

Si, I decided to use: class_weight.compute_class_weight

from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

model_1.fit(X_train, y_train, class_weight=class_weights)

metrics.roc_auc_score(y_train, model_loop_a.predict_proba(X_train), multi_class='ovr')

metrics.roc_auc_score(y_test, model_loop_a.predict_proba(X_test), multi_class='ovr')

Nevertheless, when I try to use class_weight.compute_class_weight like above, I have the following error: TypeError: fit() got an unexpected keyword argument 'class_weight'

How can i fix that, or maybe you have some other idea how to avoid such HUGE overfitting on my multiclass classification model in Python ?

seralouk · Accepted Answer · 2023-02-14T10:04:13.413

1

The issue in your case seems to be that the OneVsRestClassifier object does not support the class_weight parameter as base estimator see doc

A way around this would be to use the "balanced" parameter (as a float = 1) in the XGBClassifier definition (this will automatically adjust the weights of each class based on their frequency in the training set).

model_1 = OneVsRestClassifier(XGBClassifier(scale_pos_weight=1))

This will force the balancing of positive and negative weights.

scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.

See also the doc: https://xgboost.readthedocs.io/en/stable/python/python_api.html

edited Feb 14 '23 at 10:04

answered Feb 14 '23 at 10:00

seralouk

30,938
9
118
133

seralouk I used your code but I have error like the following: XGBoostError: Invalid Parameter format for scale_pos_weight expect float but value='balanced' what can I do ? – dingaro Feb 14 '23 at 10:03
see my edit, you need a float to set the flag to True / False – seralouk Feb 14 '23 at 10:06
but i still have really overfitted model 0.992 on train and 0.76 on test, do you have some idea ? – dingaro Feb 14 '23 at 10:08
1

@dingaro I provided a solution to your technical problem. This is a very wide question. You can try first to reduce the `n_estimators` and then maybe also the `max_depth` and `max_leaves`. 0.92 in training and 0.76 in test is not so strong indicator of overfitting. you need to look at the training/test curves to determine this. Also do not use accuracy as metric but maybe better balanced accuracy or F1 score – seralouk Feb 14 '23 at 10:19
Ok, you are right seralok, I gave you best answer and with overfitting i will fight on my owe, thank you very much! :) – dingaro Feb 14 '23 at 10:25

How to avoid overfitting on multiclass classification OvR Xgboost model / class_weight in Python?

1 Answers1