Multi-label classification with SKlearn - How do you use a validation set?

Question

Problem

I would like to use a validation dataset for early stopping while doing multi-label classification, but it seems that sklearn's MultiOutputClassifier doesn't support that. Do you have any suggestions for a solution?

What I have done

import numpy, sklearn
from sklearn.multioutput import MultiOutputClassifier
from xgboost import XGBClassifier

# Creating some multi-label data
X_train = numpy.array([[1,2,3],[4,5,6],[7,8,9]])
X_valid = numpy.array([[2,3,7],[3,4,9],[7,8,7]])
Y_train = numpy.array([[1,0],[0,1],[1,1]])
Y_valid = numpy.array([[0,1],[1,1],[0,0]])

# Creating a multi-label xgboost
xgb = XGBClassifier(n_estimators=500, random_state=0, learning_rate=0.05, eval_metric='logloss')
xgb_ml = MultiOutputClassifier(xgb)

# Training the model
xgb_ml.fit(X_train, Y_train)

Everything works as expected till here!

Now I would like to use a validation set to do some early stopping. I use the same parameters as one would use for a normal single label xgboost.

# Training model using an evaluation dataset
xgb_ml.fit(X_train, Y_train, eval_set=[(X_train, Y_train), (X_valid, Y_valid)], early_stopping_rounds=5)
>ValueError: y should be a 1d array, got an array of shape (3, 2) instead.

It seems that the eval_set parameter does not pick up that the model now needs to be evaluated during training on a multi-label dataset. Is this not supported? Or am I doing something wrong?

It's complaining about the shape of `y`. What does `Y_train ` and `Y_valid ` look like here? — Harpal, Jun 08 '21 at 13:35
The approach is erroneous. `XGBClassifier` does not support multi-label classification out-of-the-box. I believe that is why you wrapped it in a `MultiOutputClassifier` to begin with. However, you are passing the multi-label targets down to the `XGBClassifier` with the `eval_set` parameter. It won't work that way. — afsharov, Jun 08 '21 at 13:54
@afsharov in the documentation it says: "Multilabel classification support can be added to any classifier with MultiOutputClassifier". This is because the approach the function uses is simply to make n_label copies of the classifier of choice and then training an independent model for each label. That is also why I expected it to work with sub parameters, but apparently the support is limited though xgboost is one of the most popular algorithms. Here a link to the documentation: https://scikit-learn.org/stable/modules/multiclass.html#multilabel-classification — Esben Eickhardt, Jun 08 '21 at 14:13
This is exactly why it does not work. `MultiOutputClassfier` simply trains copies of the underlying estimator for each label separately. And as such, each copy of `XGBClassifier` is of course still unable to handle multi-label outputs by itself. However, you are passing the target arrays directly to them as a fit parameter in `eval_set`. — afsharov, Jun 08 '21 at 15:33

score 1 · Accepted Answer · answered Jun 08 '21 at 18:26

@afsharov identified the issue in a comment. sklearn doesn't know anything about the fit_params, it just passes them along to the individual single-output models.

MultiOutputClassifier doesn't do very much, so it wouldn't be a big deal to simply loop through the targets, fit xgboost models, and save them into a list. The main hit would seem to be the loss of parallelization, but you could do that yourself as well.

If you really wanted everything wrapped up in a class, I think deriving from MultiOutputClassifier and overriding the fit method should be enough. You'd copy most of the original fit method (the classes_ attribute setting and most of the parent class _MultiOutputEstimator's fit method), but break the eval_set second elements into their columns and zip them together for the parallel fitting. Something along the lines of:

# current code
        fit_params_validated = _check_fit_params(X, fit_params)

        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_estimator)(
                self.estimator, X, y[:, i], sample_weight,
                **fit_params_validated)
            for i in range(y.shape[1]))

(source) to

        fit_params_validated = _check_fit_params(X, fit_params)
        eval_set = fit_params_validated.pop("eval_set", [(X, y)])
        eval_set_sliced = [(eval_set_i[0], eval_set_i[1][:, i]) for eval_set_i in eval_set]

        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_estimator)(
                self.estimator, X, y[:, i], sample_weight,
                eval_set=eval_set_sliced[i],
                **fit_params_validated)
            for i in range(y.shape[1]))

Thanks Ben, it seems the MultiOutputClassifier is not really worth while, and that I am better of just building my own. When I read the documentation, I thought it would be a convenient wrapper, but it turns out it has been more of a hassle with the more complex algorithms. It was fine for models that are not reliant on early stopping. — Esben Eickhardt, Jun 22 '21 at 08:20
@EsbenEickhardt did you find any drop-in solution to this? I'm also considering building one myself, but was wondering if you have any advice. — Johan Dettmar, Mar 18 '22 at 17:14
@EsbenEickhardt anything modular/generic enough to share as an answer? — Johan Dettmar, Mar 21 '22 at 14:51
I didn't write a library or anything generic. I just think I did a one vs all classification with classic early stopping, and then did that for each class, resulting in x-models. — Esben Eickhardt, Mar 22 '22 at 08:59

Multi-label classification with SKlearn - How do you use a validation set?

Problem

What I have done

1 Answers1