37

i am trying to do hyperparemeter search with using scikit-learn's GridSearchCV on XGBoost. During gridsearch i'd like it to early stop, since it reduce search time drastically and (expecting to) have better results on my prediction/regression task. I am using XGBoost via its Scikit-Learn API.

    model = xgb.XGBRegressor()
    GridSearchCV(model, paramGrid, verbose=verbose ,fit_params={'early_stopping_rounds':42}, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]), n_jobs=n_jobs, iid=iid).fit(trainX,trainY)

I tried to give early stopping parameters with using fit_params, but then it throws this error which is basically because of lack of validation set which is required for early stopping:

/opt/anaconda/anaconda3/lib/python3.5/site-packages/xgboost/callback.py in callback(env=XGBoostCallbackEnv(model=<xgboost.core.Booster o...teration=4000, rank=0, evaluation_result_list=[]))
    187         else:
    188             assert env.cvfolds is not None
    189 
    190     def callback(env):
    191         """internal function"""
--> 192         score = env.evaluation_result_list[-1][1]
        score = undefined
        env.evaluation_result_list = []
    193         if len(state) == 0:
    194             init(env)
    195         best_score = state['best_score']
    196         best_iteration = state['best_iteration']

How can i apply GridSearch on XGBoost with using early_stopping_rounds?

note: model is working without gridsearch, also GridSearch works without 'fit_params={'early_stopping_rounds':42}

ayyayyekokojambo
  • 1,165
  • 3
  • 13
  • 33

3 Answers3

23

When using early_stopping_rounds you also have to give eval_metric and eval_set as input parameter for the fit method. Early stopping is done via calculating the error on an evaluation set. The error has to decrease every early_stopping_rounds otherwise the generation of additional trees is stopped early.

See the documentation of xgboosts fit method for details.

Here you see a minimal fully working example:

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()
gridsearch = GridSearchCV(model, paramGrid, verbose=1 ,
         fit_params=fit_params,
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX,trainY]))
gridsearch.fit(trainX,trainY)
glao
  • 412
  • 4
  • 5
  • 18
    thanks for reply, it works. but giving pre-defined eval_set is against the nature of the cross validation i guess. – ayyayyekokojambo Mar 31 '17 at 13:14
  • I guess what you mean is that in real applications you have to make sure eval_set and train set are not overlapping or are the same as here - should have added that. I used the train set just for the sake of simplicity. Early stopping based on the train data does not prevent overfitting. – glao Mar 31 '17 at 13:25
  • 14
    @glao: the eval set should be the hold-out set of the cross-validation process to make everything work as intended. – Michael M Nov 23 '17 at 08:58
  • 3
    nowadays "fit_params" is not recommendable because it is going to be deprecated. – lbcommer Dec 11 '17 at 16:57
  • 2
    Thanks @MichaelM, and how exactly can we do that? Any help – Vasim Mar 22 '19 at 06:14
  • @MichaelM He is right. valid_set should be a hold-out set. – user3595632 Sep 02 '19 at 23:59
  • 1
    @ayyayyekokojambo I think you are right. If we perform CV, we do not need a hold-out validation set. I think CV is designed to optimize traditional train-vaild split method. – Travis Nov 24 '19 at 13:03
20

An update to @glao's answer and a response to @Vasim's comment/question, as of sklearn 0.21.3 (note that fit_params has been moved out of the instantiation of GridSearchCV and been moved into the fit() method; also, the import specifically pulls in the sklearn wrapper module from xgboost):

import xgboost.sklearn as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

cv = 2

trainX= [[1], [2], [3], [4], [5]]
trainY = [1, 2, 3, 4, 5]

# these are the evaluation sets
testX = trainX 
testY = trainY

paramGrid = {"subsample" : [0.5, 0.8]}

fit_params={"early_stopping_rounds":42, 
            "eval_metric" : "mae", 
            "eval_set" : [[testX, testY]]}

model = xgb.XGBRegressor()

gridsearch = GridSearchCV(model, paramGrid, verbose=1,             
         cv=TimeSeriesSplit(n_splits=cv).get_n_splits([trainX, trainY]))

gridsearch.fit(trainX, trainY, **fit_params)
emigre459
  • 360
  • 4
  • 9
  • hi - can this be done using stratifiedkfold as well ? – Sandeep Aug 18 '19 at 15:25
  • @Sandeep: yup, that's actually the default if you choose to simply specify the `cv` parameter in `GridSearchCV` as an integer (indicating how many folds you want to use). i'm afraid I'm not too familiar with the `TimeSeriesSplit` method though, so if you want to use that you should check out the docs. – emigre459 Aug 20 '19 at 15:58
  • thanks for the reply, this solution was what i had been looking for. – ayyayyekokojambo Sep 12 '19 at 10:41
  • 1
    good idea, just one question, xgboost will use a different validation set for each cv to check for early stopping? – romulomadu Aug 26 '20 at 00:11
  • Is it intended that the training and evaluation sets are the same? IE, you set `testX = trainX`. – Yike Lu Mar 17 '22 at 21:46
  • @YikeLu, I think I was just being lazy by not making a set of fake other arrays for the test data :) Sorry for the confusion. – emigre459 Mar 19 '22 at 19:27
  • @emigre459 no problem, it's more the docs and behavior that are confusing. I have just run with early_stopping_rounds using the xgb.cv method and it does NOT ask for an`eval set` (I'm assuming it just uses the CV folds), and in fact does not require entry of `eval_metric` either, it just uses objective by default. (Edit/reposted to remove point about return value which I figured out on my own). – Yike Lu Mar 23 '22 at 01:58
  • I don't think that this solution works as asked in the OP. It seems to use the same validation set for early stopping, not the CV fold. – Michael M Jul 19 '23 at 04:51
4

Here's a solution that works in a Pipeline with GridSearchCV. The challenge occurs when you have a pipeline that is required to pre-process your training data. For example, when X is a text document and you need TFTDFVectorizer to vectorize it.

Over-ride the XGBRegressor or XGBClssifier.fit() Function

  • This step uses train_test_split() to select the specified number of validation records from X for the eval_set and then passes the remaining records along to fit().
  • A new parameter eval_test_size is added to .fit() to control the number of validation records. (see train_test_split test_size documenation)
  • **kwargs passes along any other parameters added by the user for the XGBRegressor.fit() function.
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

class XGBRegressor_ES(XGBRegressor):
    
    def fit(self, X, y, *, eval_test_size=None, **kwargs):
        
        if eval_test_size is not None:
        
            params = super(XGBRegressor, self).get_xgb_params()
            
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=eval_test_size, random_state=params['random_state'])
            
            eval_set = [(X_test, y_test)]
            
            # Could add (X_train, y_train) to eval_set 
            # to get .eval_results() for both train and test
            #eval_set = [(X_train, y_train),(X_test, y_test)] 
            
            kwargs['eval_set'] = eval_set
            
        return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs) 

Example Usage

Below is a multistep pipeline that includes multiple transformations to X. The pipeline's fit() function passes the new evaluation parameter to the XGBRegressor_ES class above as xgbr__eval_test_size=200. In this example:

  • X_train contains text documents passed to the pipeline.
  • XGBRegressor_ES.fit() uses train_test_split() to select 200 records from X_train for the validation set and early stopping. (This could also be a percentage such as xgbr__eval_test_size=0.2)
  • The remaining records in X_train are passed along to XGBRegressor.fit() for the actual fit().
  • Early stopping may now occur after 75 rounds of unchanged boosting for each cv fold in a gridsearch.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
   
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                     ('vt',VarianceThreshold()),
                     ('scaler', StandardScaler()),
                     ('Sp', SelectPercentile()),
                     ('xgbr',XGBRegressor_ES(n_estimators=2000,
                                             objective='reg:squarederror',
                                             eval_metric='mae',
                                             learning_rate=0.0001,
                                             random_state=7))    ])

X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values

Example Fitting the Pipeline:

%time xgbr_pipe.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

Example Fitting GridSearchCV:

learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)

grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)
Jake Drew
  • 2,230
  • 23
  • 29