GridSeachCV with separate training & validation sets erroneously takes also into account the training results for finally choosing the best model

Question

I have a dataset of 3500 observations x 70 features which is my training set and I also have a dataset of 600 observations x 70 features which is my validation set. The target is to classify observations correctly either as 0 or 1.

I use the Xgboost and I aim at the highest possible precision at classification threshold = 0.5.

I am conducting a grid search:

import numpy as np
import pandas as pd
import xgboost

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')
 
# Specify 'data_test' as validation set for the Grid Search below
from sklearn.model_selection import PredefinedSplit
X, y, train_valid_indices = train_valid_merge(data_train, data_valid)
train_valid_merge_indices = PredefinedSplit(test_fold=train_valid_indices)

# Define my own scoring function to see
# if it is called for both the training and the validation sets
from sklearn.metrics import make_scorer
custom_scorer = make_scorer(score_func=my_precision, greater_is_better=True, needs_proba=False)

# Instantiate xgboost
from xgboost.sklearn import XGBClassifier
classifier = XGBClassifier(random_state=0)

# Small parameters' grid ONLY FOR START
# I plan to use way bigger parameters' grids 
parameters = {'n_estimators': [150, 175, 200]}

# Execute grid search and retrieve the best classifier
from sklearn.model_selection import GridSearchCV
classifiers_grid = GridSearchCV(estimator=classifier, param_grid=parameters, scoring=custom_scorer,
                                   cv=train_valid_merge_indices, refit=True, n_jobs=-1)
classifiers_grid.fit(X, y)

............................................................................

train_valid_merge - Specify my own validation set:

I want to do the training of every model with my training set (data_train) and the hyperparameter tuning with a distinct/separate validation set of mine (data_valid). For this reason I define a function called train_valid_merge which concatenates my training and my validation set so that they can be fed to the GridSeachCV and I also used PredefineSplit to specify which is the training and which is the validation set at this merged set:

def train_valid_merge(data_train, data_valid):

    # Set test_fold values to -1 for training observations
    train_indices = [-1]*len(data_train)

    # Set test_fold values to 0 for validation observations
    valid_indices = [0]*len(data_valid)

    # Concatenate the indices for the training and validation sets
    train_valid_indices = train_indices + valid_indices

    # Concatenate data_train & data_valid
    import pandas as pd
    data = pd.concat([data_train, data_valid], axis=0, ignore_index=True)
    X = data.iloc[:, :-1].values
    y = data.iloc[:, -1].values
    return X, y, train_valid_indices

............................................................................

custom_scorer - Specify my own scoring metric:

I define my own scoring function which simply returns the precision just to see if it is called for both the training and the validation sets:

def my_precision(y_true, y_predict):

    # Check length of 'y_true' to see if it is the training or the validation set
    print(len(y_true))

    # Calculate precision
    from sklearn.metrics import precision_score
    precision = precision_score(y_true, y_predict, average='binary')

    return precision

............................................................................

When I run the whole thing (for parameters = {'n_estimators': [150, 175, 200]}) then the following things are printed from the print(len(y_true)) at the my_precision function:

which means that the scoring function is called both for the training and the validation set. But I have tested that that the scoring function is not only called but its results from both the training and validation sets are used to determine the best model from the grid search (even though I have specified it to use only the validation set results).

For example with our 3 parameters values ('n_estimators': [150, 175, 200]) it takes into account the score for both the training and the validation set (2 sets) and hence it produces (3 parameters)x(2 sets) = 6 different grid results. So it picks out the best hyperparameters sets from all these grid results and consequently it may finally pick out one which was from the results with the training set while I wanted to take into account only the validation set (3 results).

However, if I add to the my_precision function something like that to circumvent the training set (by setting all its precision values to 0):

# Remember that the training set has 3500 observations
# and the validation set 600 observations
if(len(y_true>600)):
    return 0

then (as far as I tested it) I certainly get right the best model for my specifications because the training set precision results are too small since they are all 0 to.

My questions are the following:

Why the custom scoring function is taking into account both the training and the validation set to pick out the best model while I have specified with my train_valid_merge_indices that the best model for the Grid Search should be only selected according to the validation set?

How can I make the GridSearchCV to account only for the validation set and the score of the models at it when the selection and the ranking of the models will be done?

First of all, I would kindly suggest you change the title - as is, it implies that this is the problem, where in fact this is your *requirement*... — desertnaut, Oct 02 '18 at 11:01
2) your `train_valid_split` function is again a misnomer - you are actually *merging* the sets, and certainly not splitting them. Not quite sure what exactly you are trying to accomplish here, or why you are mixing `GridSearchCV` (which is based on, well, *CV*) with the training/validation split approach, which in principle is a completely different one... — desertnaut, Oct 02 '18 at 11:08
@desertnaut, 1) ok I modified it. I hope that this is better! 2) I simply merge the datasets to be able to feed them at the Grid Search and with the `PredefineSplit` I specify which is the training and which is the validation set at this merged set. This is the only way you can use your own validation set with `GridSearchCV`. — Outcast, Oct 02 '18 at 11:11
@desertnaut, Is this clearer now? By the way, do you think that my requirement ("to pick out the best hyperparameters set only based on the validation set results") is absolutely unreasonable? — Outcast, Oct 02 '18 at 11:21
Not only it is *not* unreasonable, it is exactly the norm, and what functions like `GridSearchCV` do by default! That's why I am still puzzled why you are intermixing things here (i.e. CV with training/validation split)... — desertnaut, Oct 02 '18 at 11:23
Nobody (I mean, **nobody**) picks hyperparameters based on the performance on the *training* set... — desertnaut, Oct 02 '18 at 11:24
@desertnaut, that's my point!..haha... The problem is that if you want to use your own distinct validation set with `GridSearchCV` then it seems that `GridSearchCV` takes also into account the training set results. That's the point of my post. — Outcast, Oct 02 '18 at 11:27
@desertnaut I am probably not enough clear but I do not want to simply do the typical k-fold cross validation where I split my training set at k-folds, train my model at the k-1 folds, test it at the k-fold etc. I have one distinct training set and one distinct validation set. I want to train my model on the training set and find the best hyperparameters based on its performance on my distinct validation set. — Outcast, Oct 02 '18 at 11:32
@desertnaut, But then as I write at my comment above: "The problem is that if you want to use your own distinct/separate validation set with GridSearchCV then it seems that GridSearchCV takes also into account the training set results". I hope that it's clearer now...haha...thank you in advance for your patience! — Outcast, Oct 02 '18 at 11:42

desertnaut · Accepted Answer · 2018-10-02T12:34:45.043

1

I have one distinct training set and one distinct validation set. I want to train my model on the training set and find the best hyperparameters based on its performance on my distinct validation set.

Then you most certainly need neither PredefinedSplit nor GridSearchCV:

import pandas as pd
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import precision_score

# Import datasets from edge node
data_train = pd.read_csv('data.csv')
data_valid = pd.read_csv('data_valid.csv')

# training data & labels:
X = data_train.iloc[:, :-1].values
y = data_train.iloc[:, -1].values   

# validation data & labels:
X_valid = data_valid.iloc[:, :-1].values
y_true = data_valid.iloc[:, -1].values 

n_estimators = [150, 175, 200]
perf = []

for k_estimators in n_estimators:
    clf = XGBClassifier(n_estimators=k_estimators, random_state=0)
    clf.fit(X, y)

    y_predict = clf.predict(X_valid)
    precision = precision_score(y_true, y_predict, average='binary')
    perf.append(precision)

and perf will contain the performance of your respective classifiers on your validation set...

edited Oct 02 '18 at 12:34

answered Oct 02 '18 at 11:57

desertnaut

57,590
26
140
166

1

Thank you for your answer(upvote). However, please keep in mind that this small parameters grid (`n_estimators = [150, 175, 200]`) is only for start so I will do a way more extensive grid search and in this sense I think that it is better to do it with `GridSearchCV`. And hence if I want to use `GridSearchCV` with my separate validation set then I will have to use probably `PredefinedSplit` (https://stackoverflow.com/questions/31948879/using-explict-predefined-validation-set-for-grid-search-with-sklearn). – Outcast Oct 02 '18 at 12:01
(Apologies for being repetitive) But then **the problem is that if you want to do Grid Search with `GridSearchCV` and use your own separate validation set (hence also use `PredefinedSplit`) then it seems that `GridSearchCV` takes also into account the training set results (unless I do this trick at the scoring function above and set all the training results to 0 by myself)** – Outcast Oct 02 '18 at 12:06
@PoeteMaudit as its name clearly implies, `GridSearchCV` is (at least in principle) **not** suited for predefined training/validation set approaches (which approaches are different & distinct from *CV*); even for more extensive grid searches, you can simply use nested `for` loops, possibly augmenting the `perf` list for bookkeeping... – desertnaut Oct 02 '18 at 12:11
Hm, ok even though using `for` loops is like writing a modified `GridSearchCV` from scratch. I think that the fastest thing to do is what I did with `PredefinedSplit` and creating your own scoring function simply to set all the training results to 0 by yourself (and leave the validation results untouched to be compared). – Outcast Oct 02 '18 at 12:20
Apologies for my insistence but I always want to know the opinion of senior data scientists about how some libraries or algorithms work etc. Surprisingly enough I may come up with some cases which even senior data scientists have not exactly encountered (separate validation set, `GridSearchCV`, `PredefinedSplit`, training & validation results). In any case, thank you very much for your time :) – Outcast Oct 02 '18 at 12:20
@PoeteMaudit we do face such cases every day! And we treat them like I have shown - in ways that are *simple* and give you *full & explicit control* over the procedure, with nothing hidden under any hood... ;) – desertnaut Oct 02 '18 at 12:24
@PoeteMaudit and this is arguably the reason why you cannot find anything "ready" in the available libraries for this case: because it is ridiculously simple to do it like that (something that is not true for, say, grid search + CV)... – desertnaut Oct 02 '18 at 12:29
Ok I see your point. But I do not consider it exactly so ridiculous - you will probably write the same number of lines as I did even above with my custom scoring function etc. Secondly, other packages like H2O, contrary to SkLearn, provide you directly the possibililty to use your own validation set(http://docs.h2o.ai/h2o/latest-stable/h2o-docs/grid-search.html, https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/tutorials/random%20hyperparmeter%20search%20and%20roadmap.md) – Outcast Oct 02 '18 at 12:47
@PoeteMaudit start building it, and you'll see that no. of lines is not an issue ;) ; as for packages like H2O, you are absolutely right, but this is scikit-learn... – desertnaut Oct 02 '18 at 12:53
@desertnaut, "cross-validation" does not have to mean k-fold, although that is certainly the most common strategy. `GridSearchCV` is happy to have a `PredefinedSplit` as its split generator, and so I would argue that it is "ready in the available libraries for this case". Note that `GridSearchCV` also takes care of parallelization, training/scoring errors, and so it's convenient even for the simplest of splitting procedures. – Ben Reiniger May 30 '21 at 17:51

score 1 · Answer 2 · answered May 30 '21 at 17:46

which means that the scoring function is called both for the training and the validation set...

This is probably true.

...But I have tested that that the scoring function is not only called but its results from both the training and validation sets are used to determine the best model from the grid search (even though I have specified it to use only the validation set results).

But this is probably not true.

There is a parameter return_train_score; when True, the scores the training data and returns those as part of the cv_results_ attribute. Prior to v0.21, the default of this parameter was True, and after False. However, those scores are not used for determining the best hyperparameters, unless you have a customer scoring method that takes them into account. (If you think you have a counterexample, please provide the cv_results_ and best_params_.)

Why the custom scoring function is taking into account both the training and the validation set to pick out the best model while I have specified with my train_valid_merge_indices that the best model for the Grid Search should be only selected according to the validation set?

It is (probably) not, see above.

How can I make the GridSearchCV to account only for the validation set and the score of the models at it when the selection and the ranking of the models will be done?

It does this by default.

GridSeachCV with separate training & validation sets erroneously takes also into account the training results for finally choosing the best model

2 Answers2