How are the test scores in cv_results_ and best_score_ calculated in scikit-optimize?

Question

I'm using BayesSearchCV from scikit-optimize to optimise an XGBoost model to fit some data I have. While the model fits fine, I am puzzled by the scores provided in the diagnostic information and am unable to replicate them.

Here's an example script using the Boston house prices dataset to illustrate my point:

from sklearn.datasets import load_boston

import numpy as np
import pandas as pd

from xgboost.sklearn import XGBRegressor

from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import KFold, train_test_split 

boston = load_boston()

# Dataset info:
print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)

# Put data into dataframe and label column headers:

data = pd.DataFrame(boston.data)
data.columns = boston.feature_names

# Add target variable to dataframe

data['PRICE'] = boston.target

# Split into X and y

X, y = data.iloc[:, :-1],data.iloc[:,-1]

# Split into training and validation datasets 

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, shuffle = True) 

# For cross-validation, split training data into 5 folds

xgb_kfold = KFold(n_splits = 5,random_state = 42)

# Run fit

xgb_params = {'n_estimators': Integer(10, 3000, 'uniform'),
               'max_depth': Integer(2, 100, 'uniform'),
               'subsample': Real(0.25, 1.0, 'uniform'),
               'learning_rate': Real(0.0001, 0.5, 'uniform'),
               'gamma': Real(0.0001, 1.0, 'uniform'),
               'colsample_bytree': Real(0.0001, 1.0, 'uniform'),
               'colsample_bylevel': Real(0.0001, 1.0, 'uniform'),
               'colsample_bynode': Real(0.0001, 1.0, 'uniform'),
               'min_child_weight': Real(1, 6, 'uniform')}

xgb_fit_params = {'early_stopping_rounds': 15, 'eval_metric': 'mae', 'eval_set': [[X_val, y_val]]}

xgb_pipe = XGBRegressor(random_state = 42,  objective='reg:squarederror', n_jobs = 10)

xgb_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_iter = 5, n_jobs = 1, random_state = 42, verbose = 4, scoring = None, fit_params = xgb_fit_params)

xgb_cv.fit(X_train, y_train)

After running this, xgb_cv.best_score_ is 0.816, and xgb_cv.best_index_ is 3. Looking at xgb_cv.cv_results_, I want to find the best scores for each fold:

print(xgb_cv.cv_results_['split0_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split1_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split2_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split3_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split4_test_score'][xgb_cv.best_index_])

Which gives:

0.8023562337946979,
 0.8337404778903412,
 0.861370681263761,
 0.8749312273014963,
 0.7058815015739375

I'm not sure what's being calculated here, because scoring is set to None in my code. XGBoost's documentation isn't much help, but according to xgb_cv.best_estimator_.score? it's supposed to be the R2 of the predicted values. Anyway, I'm unable to obtain these values when I manually try calculating the score for each fold of the data used in the fit:

# First, need to get the actual indices of the data from each fold:

kfold_indexes = {}
kfold_cnt = 0

for train_index, test_index in xgb_kfold.split(X_train):
    kfold_indexes[kfold_cnt] = {'train': train_index, 'test': test_index}
    kfold_cnt = kfold_cnt+1

# Next, calculate the score for each fold   
for p in range(5): print(xgb_cv.best_estimator_.score(X_train.iloc[kfold_indexes[p]['test']], y_train.iloc[kfold_indexes[p]['test']]))

Which gives me the following:

0.9954929618573786
0.994844803666101
0.9963108152027245
0.9962274544089832
0.9931314653538819

How is BayesSearchCV calculating the scores for each fold, and why can't I replicate them using the score function? I would be most grateful for any assistance with this issue.

(Also, manually calculating the mean of these scores gives: 0.8156560..., while xgb_cv.best_score_ gives: 0.8159277... Not sure why there's a precision difference here.)

Your first set of scores are _test_ fold scores, while the second set are _train_ fold scores. — Ben Reiniger, Mar 23 '21 at 18:06
@BenReiniger I try using the test indices, but I still don't get the right values: ```for p in range(5): print(xgb_cv.best_estimator_.score(X_train.iloc[kfold_indexes[p]['test']], y_train.iloc[kfold_indexes[p]['test']]))``` gives: 0.9954929618573786, 0.994844803666101, 0.9963108152027245, 0.9962274544089832, 0.9931314653538819 — Electronic Ant, Mar 23 '21 at 19:07
@BenReiniger: I've changed the predicted scores following your advice. What else might be wrong with my approach? — Electronic Ant, Mar 23 '21 at 19:10

Ben Reiniger · Accepted Answer · 2021-03-24T21:44:11.080

2

best_estimator_ is the refitted estimator, fitted on the entire training set after choosing the hyperparameters; so scoring it on any portion of the training set will be optimistically biased. To reproduce cv_results_, you would need to refit estimators to each training fold and score the corresponding test fold.

Beyond that, there does appear to be more randomness not covered by the XGBoost random_state. There is another parameter seed; setting that produces consistent results for me. (There are some older posts here (example) reporting similar issues even with seed set, but perhaps those have been resolved by newer versions of xgb.)

edited Mar 24 '21 at 21:44

answered Mar 23 '21 at 20:04

Ben Reiniger

10,517
3
16
29

If this is the case, then should passing ```X_train``` to ```score``` give the same value as ```best_score_```? Because I tried doing that and the values are still markedly different. – Electronic Ant Mar 23 '21 at 20:59
1

No. `best_score_` is the largest `mean_test_score`, which is based on out-of-fold scores; whereas `xgb_cv.score` is shorthand for `xgb_cv.best_estimator_.score`, which is the scoring method of the retrained model, and so "cheating". – Ben Reiniger Mar 23 '21 at 21:02
Having refitted a new XGBoost model using the same parameters as the best_estimator_ I now get much closer values (Original CV scores: ```0.80235623, 0.83374048, 0.86137068, 0.87493123, 0.7058815``` My emulation attempt: ```0.81142551, 0.82771065, 0.87877982, 0.88612015, 0.70819454```). But they're still not exact. Why would there still be a difference? – Electronic Ant Mar 24 '21 at 15:14
1

Hm. You've set random states in all of xgboost, kfold, and bayessearch, so that should be fine. Same dataset, I assume. I would try with sklearn's GBM and random search, to isolate whether the effect is due to xgboost or skopt. – Ben Reiniger Mar 24 '21 at 15:43
I've tried replacing BayesSearchCV with sklearn's RandomizedSearchCV, and for 5 folds I get the same values. However, for 2 folds the score for the 2nd fold is markedly different. Would my problem therefore be a consequence of XGBoost instead of RandomizedSearchCV or BayesSearchCV? Why would supplying the exact same parameters and data cause two different scores? – Electronic Ant Mar 24 '21 at 20:25
1

I got around to testing myself, and it does seem like an xgboost issue. I've added to the answer. – Ben Reiniger Mar 24 '21 at 23:59

How are the test scores in cv_results_ and best_score_ calculated in scikit-optimize?

1 Answers1