18

Is it possible to perform a GridSearchCV (to get the best SVM's C) and yet specify the sample_weight with scikit-learn?

Here's my code and the error I'm confronted to:

gs = GridSearchCV(
    svm.SVC(C=1),
    [{
        'kernel': ['linear'],
        'C': [.1, 1, 10],
        'probability': [True],
        'sample_weight': sw_train,
    }]
)

gs.fit(Xtrain, ytrain)

>> ValueError: Invalid parameter sample_weight for estimator SVC


Edit: I solved the issue by getting the latest scikit-learn version and using the following:

gs.fit(Xtrain, ytrain, fit_params={'sample_weight': sw_train})
tuomastik
  • 4,559
  • 5
  • 36
  • 48
user1771485
  • 181
  • 1
  • 1
  • 5
  • 2
    If you have the answer, please post it as an answer and accept it. Otherwise the question will lie around as unanswered. – joergl Oct 24 '12 at 15:07
  • 1
    I confirm the `fit_params` trick is the right answer. Please answer to yourself and validate your answer. – ogrisel Feb 16 '13 at 00:06
  • @ogrisel won't this cause `fit` to be called with the entire list of weights for each fold, rather than the weights of just the datapoints in the fold? – akxlr Aug 18 '15 at 09:11
  • 1
    That's a good remark but this case is actually handled properly by the internal cross-validation routines: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L1093 – ogrisel Aug 18 '15 at 09:30
  • @ogrisel Good find, thanks. – akxlr Aug 18 '15 at 11:13

7 Answers7

12

Just trying to close out this long hanging question...

You needed to get the last version of SKL and use the following:

gs.fit(Xtrain, ytrain, fit_params={'sample_weight': sw_train})

However, it is more in line with the documentation to pass fit_params to the constructor:

gs = GridSearchCV(svm.SVC(C=1), [{'kernel': ['linear'], 'C': [.1, 1, 10], 'probability': [True], 'sample_weight': sw_train}], fit_params={'sample_weight': sw_train})

gs.fit(Xtrain, ytrain)
AN6U5
  • 2,755
  • 22
  • 20
  • 1
    According to the docs `fit_params` should be passed to the constructor, not the `fit` method – akxlr Aug 18 '15 at 11:26
  • 1
    What is the shape of `sw_train`? I tried `sw_train=[{1:1}]` and it doesn't work. Any help is appreciated. – azizj Nov 27 '17 at 21:54
9

The previous answers are now obsolete. The dictionary fit_params should be passed to the fit method.

From the documentation for GridSearchCV:

fit_params : dict, optional

Parameters to pass to the fit method.

Deprecated since version 0.19: fit_params as a constructor argument was deprecated in version 0.19 and will be removed in version 0.21. Pass fit parameters to the fit method instead.

Community
  • 1
  • 1
Sycorax
  • 1,298
  • 13
  • 26
  • 3
    Has anyone confirmed that this works correctly with `sample_weight`? `GridSearchCV` calls the estimator's `fit()` method repeatedly with different subsets of `Xtrain` and `ytrain`. Does it use the corresponding subset of the sample weights each time? I would guess that it doesn't; it just calls `fit(..., **fit_params)` each time. This works fine for params such as `verbose` that aren't tied to particular samples. – David Wasserman Jan 23 '20 at 19:10
  • 1
    I've had a similar question about how `sample_weight` works in the past. You can see the question and answer here; I think this question is motivated by the same concern that you express here, but it's a little old. I don't know whether newer versions of `sklearn` work in a more intuitive way. https://stackoverflow.com/questions/49581104/sklearn-gridsearchcv-not-using-sample-weight-in-score-function – Sycorax Jan 23 '20 at 19:34
  • 3
    Thanks. I did some more searching, and found https://github.com/scikit-learn/scikit-learn/issues/2879, which shows that `GridSearchCV` is now programmed to pass the correct subset of `sample_weight` in each call to the estimator's `fit()` method. People are still dissatisfied because there's no simple way to use `sample_weight` for scoring in `GridSearchCV`. – David Wasserman Jan 23 '20 at 19:59
  • Ah! Interesting. It's good to know that at least there's been some progress on that front. The solutions in the thread that I linked are functional but do leave something to be desired in terms of simplicity. – Sycorax Jan 23 '20 at 20:02
4

In version 0.16.1, if you use Pipeline, you need to pass the param to GridSearchCV constructor:

clf = pipeline.Pipeline([('svm', svm_model)])
model = grid_search.GridSearchCV(estimator = clf, param_grid=param_grid,
    fit_params={'svm__sample_weight': sw_train})
Artur Nowak
  • 5,254
  • 3
  • 22
  • 32
  • here is the discussion and I have noticed that this behavior is still there with the 0.17 http://sourceforge.net/p/scikit-learn/mailman/message/34191031/ – Diego Dec 23 '15 at 04:05
  • 1
    What is the shape of `sw_train`? I tried `sw_train=[{1:1}]` and it doesn't work. Any help is appreciated. – azizj Nov 27 '17 at 21:54
  • @AzizJaved the parameter can be found in the documentation for the particular [classifier](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.fit). In this case, it is a one-dimensional array with one weight per sample (example). – Artur Nowak Dec 04 '17 at 10:54
4

The following works in Sklearn 0.23.1,

grid_cv = GridSearchCV(clf, param_grid=param_grid,
                       scoring='recall', n_jobs=-1, cv=10)

grid_cv.fit(x_train_orig, y=y_train_orig,
            sample_weight=my_sample_weights)
vjp
  • 303
  • 1
  • 3
  • 7
1

OP's edit and other answers are not entirely correct. While for fitting fit_params={'sample_weight': weights} works, those weight will not be used to compute validation loss! (github issue).

Consequently, cross-validation will report unweighted loss, and thus the hyper-parameter-tuning might get steered off into the wrong direction.

Here is my work-around for cross-validation with class weights using accuracy as metric. Should also work with other metrics.

from sklearn.metrics import accuracy_score
from sklearn.utils import compute_sample_weight
from sklearn.metrics import make_scorer


def weighted_accuracy_eval(y_pred, y_true, **kwargs):
    balanced_class_weights_eval = compute_sample_weight(
        class_weight='balanced',
        y=y_true
    )
    out = accuracy_score(y_pred=y_pred, y_true=y_true, sample_weight=balanced_class_weights_eval, **kwargs)
    return out


weighted_accuracy_eval_skl = make_scorer(weighted_accuracy_eval)

gridsearch = GridSearchCV(
    estimator=model,
    scoring=weighted_accuracy_eval,
    param_grid=paramGrid,
)

cv_result = gridsearch.fit(
    X_train,
    y_train,
    fit_params=fit_params
)
Ufos
  • 3,083
  • 2
  • 32
  • 36
  • Could you elaborate a bit further? What are the `fit_params` supposed to be in this case? Is it `{class_weight: weights_array}`? Does this fit and evaluate with a specified array of weights? It's not clear to me that's what this code is supposed to do. – Luis Fernando Cantu Nov 10 '21 at 02:15
  • As you can see this is a highly advanced and complicated example. OP already provides an answer along with other answers saying "use `fit_params`, so my answer here is really another spin on the situation. If I were to write a proper tutorial on fitting with weights and then to explain my workaround, it would become a medium article. – Ufos Nov 10 '21 at 13:00
  • Nonetheless, `fit_params` is a dictionary of parameters that will be passed to the `.fit` method of your estimator. When using sklearn pipelines vs writing a custom loop you suddenly don't can't control deep behavior, e.g. the API does not provide a way to specify `class_weights` for imbalanced training. But you can work around using `fit_params` – Ufos Nov 10 '21 at 13:02
  • So, `class_weights` in `fit_params` should comply with `estimator.fit(..., class_weights=weights_array)` -- for which see this thread https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work – Ufos Nov 10 '21 at 13:05
0

Great question and great answers! (Thanks @Sycorax, @AN6U5, and @user1771485). All of them helped me a lot to find an answer to the specific case, where I needed to use sample_weight during GridSearchCV, but my estimator was obtained using Pipeline. The issue differs from the previous solutions because Pipeline does not support fit_param; indeed, if you try to use fit_param = {... } during the fit step (of GridSearchCV) you'll get

Pipeline.fit does not accept the fit_param parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight)

The pipeline I was using was

pipe = Pipeline(steps=[('normalizer', norm), ('estimator', svr)])

where norm was a normalization step, svr = SVR(), and the parameter grid

parameters_svr = dict (estimator = [svr], estimator__kernel =  ['rbf', 'sigmoid'], ...)

Then, as advised by @user1771485

grid = GridSearchCV (estimator = pipe, param_grid = parameters_svr, cv = 3,                            
                     scoring = 'neg_mean_squared_error', 
                     return_train_score = True, refit = True, n_jobs = -1)

and finally, (the part that truly matters)

grid.fit (X,y, estimator__sample_weight= weights)
0

In scikit-learn version 1.1.1 you can pass sample_weight directly to the fit() of GridSearchCV.

For example:

def get_weights(cls):
    class_weights = {
        # class-labels based on your dataset.
        0: 1,
        1: 4,
        2: 1,
    }

    return [class_weights[cl] for cl in cls]

grid = {
    "max_depth": [3, 4, 5, 6],
    "n_estimators": range(20, 70, 10),
    "learning_rate": np.arange(0.25, 0.50, 0.05),
}

xgb_clf = XGBClassifier(random_state=42, n_jobs=-1)
xgb_cvm = GridSearchCV(estimator=xgb_clf, param_grid=grid, n_jobs=-1, cv=5)
xgb_cvm.fit(X, y, sample_weight=get_weights(y))
Dheemanth Bhat
  • 4,269
  • 2
  • 21
  • 40