5

Currently I have the following code:

I start by splitting the dataset into training and test sets. I then run GridSearchCV to try and find the optimal parameters. After I have found the optimal parameters I then assess the classifier with the parameter via cross_val_score. Is this an acceptable way to go about this?

WewLad
  • 717
  • 2
  • 11
  • 22

2 Answers2

5

You can specify a scoring parameter inside the GridSearchCV object like this using make_scorer

from sklearn.metrics import precision_score, make_scorer
prec_metric = make_scorer(precision_score)
grid_search = GridSearchCV(estimator = logreg, scoring= prec_metric param_grid = param_grid, cv = 3, n_jobs=-1, verbose=3)

Once you have fitted your data, you can use results_ attribute to access the scores like this

results = grid_search.results_

{
'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
                         mask = [False False False False]...)
'param_gamma': masked_array(data = [-- -- 0.1 0.2],
                        mask = [ True  True False False]...),
'param_degree': masked_array(data = [2.0 3.0 -- --],
                         mask = [False False  True  True]...),
 'split0_test_score'  : [0.8, 0.7, 0.8, 0.9],
 'split1_test_score'  : [0.82, 0.5, 0.7, 0.78],
 'mean_test_score'    : [0.81, 0.60, 0.75, 0.82],
 'std_test_score'     : [0.02, 0.01, 0.03, 0.03],
 'rank_test_score'    : [2, 4, 3, 1],
 'split0_train_score' : [0.8, 0.9, 0.7],
 'split1_train_score' : [0.82, 0.5, 0.7],
 'mean_train_score'   : [0.81, 0.7, 0.7],
 'std_train_score'    : [0.03, 0.03, 0.04],
 'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
 'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
 'mean_score_time'    : [0.007, 0.06, 0.04, 0.04],
 'std_score_time'     : [0.001, 0.002, 0.003, 0.005],
 'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
 }

You can also use multiple metrics for evaluation as mentioned in this example.

You can make your own custom metric or use one of the metrics specified here.

Update : Based on this answer, you should then feed the classfier from grid_search before fitting on the whole data to cross_val_score, to prevent any data leakage.

Gambit1614
  • 8,547
  • 1
  • 25
  • 51
  • So when I use GridSearchCV, do I want to pass the entire dataset and ignore train_test_split? Will mean_test_score then provide testing results for a fold that was used for testing? – WewLad Jun 26 '18 at 08:09
  • 1
    Yes it will provide testing rsults for a fold that was used for testing and yes, you pass the whole dataset as Cross validation is taken care by GridSearch itself. – Gambit1614 Jun 26 '18 at 08:10
  • Sorry for the constant questions, are the scores that it predicts for the best parameters that the grid search has found? – WewLad Jun 26 '18 at 08:17
  • 1
    Yes it is for the best parameters – Gambit1614 Jun 26 '18 at 08:31
  • I'm looking at this post and it suggests splitting before I feed the data to GridSearchCV as data may leak from the training set and skew the results. https://stackoverflow.com/a/49165571/5171293. – WewLad Jun 26 '18 at 08:51
  • Yes this is correct and the data may leak, but the post is suggesting to keep a part of the dataset completely seperate for measuring the final score. I mean it should not be used for parameter tuning at all. – Gambit1614 Jun 26 '18 at 09:29
  • 1
    So would you suggest this is a good methodology: split dataset > use GridSearchCV for optimal parameters on training set > fit logreg with the training set and the parameters > test cross_val_score on the test set. – WewLad Jun 26 '18 at 09:42
  • Yes that should be correct order of training/testing your classfier. – Gambit1614 Jun 26 '18 at 09:43
1

You actually don't need the cross_val_score

Check out the link I think it will sort things out for you:

http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html

Amit_JCI
  • 159
  • 1
  • 14
  • I'm a bit confused by what's going on in that example. What is 'mean_test_score' if it hasn't been fitted to the test set yet? – WewLad Jun 26 '18 at 07:48
  • It is not connected to the test set. Test set should not be fitted at any time! (Unless you want to overfit). The test score is the mean score of the cross validation test (done only on the training set). – Amit_JCI Jun 26 '18 at 07:59
  • If I were to use GridSearchCV on my full dataset would the results returned be overfitted results or the results of a fold reserved for testing? – WewLad Jun 26 '18 at 08:45
  • It is always recommended to have a separate hold out set but if your data is very small you can sattle with the mean CV error – Amit_JCI Jun 26 '18 at 14:34