31

Is it possible to use XGBoost for multi-label classification? Now I use OneVsRestClassifier over GradientBoostingClassifier from sklearn. It works, but use only one core from my CPU. In my data I have ~45 features and the task is to predict about 20 columns with binary (boolean) data. Metric is mean average precision (map@7). If you have a short example of code to share, that would be great.

Ric S
  • 9,073
  • 3
  • 25
  • 51
user3318023
  • 433
  • 1
  • 4
  • 9

3 Answers3

20

One possible approach, instead of using OneVsRestClassifier which is for multi-class tasks, is to use MultiOutputClassifier from the sklearn.multioutput module.

Below is a small reproducible sample code with the number of input features and target outputs requested by the OP

import xgboost as xgb
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score

# create sample dataset
X, y = make_multilabel_classification(n_samples=3000, n_features=45, n_classes=20, n_labels=1,
                                      allow_unlabeled=False, random_state=42)

# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# create XGBoost instance with default hyper-parameters
xgb_estimator = xgb.XGBClassifier(objective='binary:logistic')

# create MultiOutputClassifier instance with XGBoost model inside
multilabel_model = MultiOutputClassifier(xgb_estimator)

# fit the model
multilabel_model.fit(X_train, y_train)

# evaluate on test data
print('Accuracy on test data: {:.1f}%'.format(accuracy_score(y_test, multilabel_model.predict(X_test))*100))
Ric S
  • 9,073
  • 3
  • 25
  • 51
  • 1
    Sir, what if the data feeding to the XGBoost model contains missing values?Though I knew xgboost works well with binary classification even the data contains missing values, but a preliminary try shows that it does not work with multilabel classifiction, when the data contains missing values. – user3737702 Aug 19 '21 at 13:55
  • I'm afraid I cannot help you specifically. However, I suggest you to impute missing values before training the model – Ric S Aug 19 '21 at 17:27
  • 1
    wow this is so cool, I had no idea sklearn.multioutput.MultiOutputClassifier exists and can wrap an arbitrary classifier like XGBoost. That's so helpful. – Max Power Jun 15 '22 at 18:41
9

There are a couple of ways to do that, one of which is the one you already suggested:

1.

from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
# If you want to avoid the OneVsRestClassifier magic switch
# from sklearn.multioutput import MultiOutputClassifier

clf_multilabel = OneVsRestClassifier(XGBClassifier(**params))

clf_multilabel will fit one binary classifier per class, and it will use however many cores you specify in params (fyi, you can also specify n_jobs in OneVsRestClassifier, but that eats up more memory).

2. If you first massage your data a little by making k copies of every data point that has k correct labels, you can hack your way to a simpler multiclass problem. At that point, just

clf = XGBClassifier(**params)
clf.fit(train_data)
pred_proba = clf.predict_proba(test_data)

to get classification margins/probabilities for each class and decide what threshold you want for predicting a label. Note that this solution is not exact: if a product has tags (1, 2, 3), you artificially introduce two negative samples for each class.

marco_ccc
  • 101
  • 1
  • 4
  • 2
    It should be noted that this can be very expensive to train if you have many labels, as it trains one model per label. I'm using `xgboost.XGBClassifier` as the underlying classifier and training using 5-fold cross validation takes hours. @marco_ccc Thanks for this, it was very helpful. – rjurney Jul 06 '19 at 03:30
4

You can add a label to each class you want to predict. For example if this is your data:

X1 X2 X3 X4  Y1 Y2 Y3
 1  3  4  6   7  8  9
 2  5  5  5   5  3  2

You can simply reshape your data by adding a label to the input, according to the output, and xgboost should learn how to treat it accordingly, like so:

X1 X2 X3 X4 X_label Y
 1  3  4  6   1     7
 2  5  5  5   1     5
 1  3  4  6   2     8
 2  5  5  5   2     3
 1  3  4  6   3     9
 2  5  5  5   3     2

This way you will have a 1-dimensional Y, but you can still predict many labels.

Binyamin Even
  • 3,318
  • 1
  • 18
  • 45
  • 4
    Is the idea here that during test time, you would create a duplicate of each input row, changing the X_lable field and take that prediction as the label corresponding to X_label? – B_Miner Apr 19 '19 at 16:07