XGBoost for multilabel classification?

Question

Is it possible to use XGBoost for multi-label classification? Now I use OneVsRestClassifier over GradientBoostingClassifier from sklearn. It works, but use only one core from my CPU. In my data I have ~45 features and the task is to predict about 20 columns with binary (boolean) data. Metric is mean average precision (map@7). If you have a short example of code to share, that would be great.

Were you able to figure this out? Please provide a solution if so. — iOSBeginner, Apr 16 '17 at 18:23

score 20 · Answer 1 · answered May 22 '20 at 08:57

One possible approach, instead of using OneVsRestClassifier which is for multi-class tasks, is to use MultiOutputClassifier from the sklearn.multioutput module.

Below is a small reproducible sample code with the number of input features and target outputs requested by the OP

import xgboost as xgb
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score

# create sample dataset
X, y = make_multilabel_classification(n_samples=3000, n_features=45, n_classes=20, n_labels=1,
                                      allow_unlabeled=False, random_state=42)

# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# create XGBoost instance with default hyper-parameters
xgb_estimator = xgb.XGBClassifier(objective='binary:logistic')

# create MultiOutputClassifier instance with XGBoost model inside
multilabel_model = MultiOutputClassifier(xgb_estimator)

# fit the model
multilabel_model.fit(X_train, y_train)

# evaluate on test data
print('Accuracy on test data: {:.1f}%'.format(accuracy_score(y_test, multilabel_model.predict(X_test))*100))

Sir, what if the data feeding to the XGBoost model contains missing values?Though I knew xgboost works well with binary classification even the data contains missing values, but a preliminary try shows that it does not work with multilabel classifiction, when the data contains missing values. — user3737702, Aug 19 '21 at 13:55
I'm afraid I cannot help you specifically. However, I suggest you to impute missing values before training the model — Ric S, Aug 19 '21 at 17:27
wow this is so cool, I had no idea sklearn.multioutput.MultiOutputClassifier exists and can wrap an arbitrary classifier like XGBoost. That's so helpful. — Max Power, Jun 15 '22 at 18:41

score 9 · Answer 2 · answered May 22 '17 at 16:20

There are a couple of ways to do that, one of which is the one you already suggested:

1.

from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
# If you want to avoid the OneVsRestClassifier magic switch
# from sklearn.multioutput import MultiOutputClassifier

clf_multilabel = OneVsRestClassifier(XGBClassifier(**params))

clf_multilabel will fit one binary classifier per class, and it will use however many cores you specify in params (fyi, you can also specify n_jobs in OneVsRestClassifier, but that eats up more memory).

2. If you first massage your data a little by making k copies of every data point that has k correct labels, you can hack your way to a simpler multiclass problem. At that point, just

clf = XGBClassifier(**params)
clf.fit(train_data)
pred_proba = clf.predict_proba(test_data)

to get classification margins/probabilities for each class and decide what threshold you want for predicting a label. Note that this solution is not exact: if a product has tags (1, 2, 3), you artificially introduce two negative samples for each class.

It should be noted that this can be very expensive to train if you have many labels, as it trains one model per label. I'm using `xgboost.XGBClassifier` as the underlying classifier and training using 5-fold cross validation takes hours. @marco_ccc Thanks for this, it was very helpful. — rjurney, Jul 06 '19 at 03:30

Binyamin Even · Answer 3 · 2022-02-13T14:46:36.693

4

You can add a label to each class you want to predict. For example if this is your data:

X1 X2 X3 X4  Y1 Y2 Y3
 1  3  4  6   7  8  9
 2  5  5  5   5  3  2

You can simply reshape your data by adding a label to the input, according to the output, and xgboost should learn how to treat it accordingly, like so:

X1 X2 X3 X4 X_label Y
 1  3  4  6   1     7
 2  5  5  5   1     5
 1  3  4  6   2     8
 2  5  5  5   2     3
 1  3  4  6   3     9
 2  5  5  5   3     2

This way you will have a 1-dimensional Y, but you can still predict many labels.

edited Feb 13 '22 at 14:46

answered Aug 07 '18 at 08:28

Binyamin Even

3,318
1
18
45

4

Is the idea here that during test time, you would create a duplicate of each input row, changing the X_lable field and take that prediction as the label corresponding to X_label? – B_Miner Apr 19 '19 at 16:07

XGBoost for multilabel classification?

3 Answers3

Linked