33

There is absolutely helpful class GridSearchCV in scikit-learn to do grid search and cross validation, but I don't want to do cross validataion. I want to do grid search without cross validation and use whole data to train. To be more specific, I need to evaluate my model made by RandomForestClassifier with "oob score" during grid search. Is there easy way to do it? or should I make a class by myself?

The points are

  • I'd like to do grid search with easy way.
  • I don't want to do cross validation.
  • I need to use whole data to train.(don't want to separate to train data and test data)
  • I need to use oob score to evaluate during grid search.
ykensuke9
  • 714
  • 2
  • 7
  • 15

5 Answers5

57

I would really advise against using OOB to evaluate a model, but it is useful to know how to run a grid search outside of GridSearchCV() (I frequently do this so I can save the CV predictions from the best grid for easy model stacking). I think the easiest way is to create your grid of parameters via ParameterGrid() and then just loop through every set of params. For example assuming you have a grid dict, named "grid", and RF model object, named "rf", then you can do something like this:

for g in ParameterGrid(grid):
    rf.set_params(**g)
    rf.fit(X,y)
    # save if best
    if rf.oob_score_ > best_score:
        best_score = rf.oob_score_
        best_grid = g

print "OOB: %0.5f" % best_score 
print "Grid:", best_grid
Brian Bien
  • 723
  • 8
  • 21
David
  • 9,284
  • 3
  • 41
  • 40
  • 1
    Thank you, @David! I'll use ParameterGrid. I wonder why I should't use OOB to evaluate. If you don't mind to use your time, could you explain it or show me a link about it? – ykensuke9 Jan 06 '16 at 03:56
  • OOB error is just more likely to lead to overfitting than using some form of holdout validation. – David Jan 12 '16 at 15:20
  • I got it. I don't really know about RandomForest and OOB, so I'll study more based on your opinion. I'm expecting the reason why OOB isn't good to use as evaluation is because the data to evaluate is fewer than Cross Validation as a result of using oob, so the model specialize only in the few data and overfit. – ykensuke9 Jan 15 '16 at 02:23
  • 1
    David, do you have a citation for that claim? OOB error doesn't see the data it evaluates. – thc Mar 29 '17 at 21:45
  • 1
    David, given oob_score_ = accuracy, you are currently selecting the worst model.I think it should be 'if rf.oob_score_ > best_score:' – cpeusteuche Jun 14 '17 at 09:41
  • 5
    @David, why do you think OOB error likely leads to overfitting? As far as I understand, it should be an unbiased estimate of error rate according to https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm and other literature. – RNA Jul 24 '17 at 17:24
5

See this link: https://stackoverflow.com/a/44682305/2202107

He used cv=[(slice(None), slice(None))] which is NOT recommended by sklearn's authors.

Sida Zhou
  • 3,529
  • 2
  • 33
  • 48
  • This is such a great approach though, and it works! The link to the github issue where `sklearn` devs indicate that this is a bad practice is [here](https://github.com/scikit-learn/scikit-learn/issues/2048). – edesz Mar 14 '20 at 22:45
2

One method is to use ParameterGrid to make a iterator of the parameters you want and loop over it.

Another thing you could do is actually configure the GridSearchCV to do what you want. I wouldn't recommend this much because it's unnecessarily complicated.
What you would need to do is:

  • Use the arg cv from the docs and give it a generator which yields a tuple with all indices (so that train and test are same)
  • Change the scoring arg to use the oob given out from the Random forest.
AbdealiLoKo
  • 3,261
  • 2
  • 20
  • 36
  • Thank you AJK. As you say, the way using GridSearchCV looks little complicated and unnatural. I'll use ParameterGrid. – ykensuke9 Jan 06 '16 at 05:35
2

Although the question has been solved years ago, I just found a more natural way if you insist on using GridSearchCV() instead of other means (ParameterGrid(), etc.):

  1. Create a sklearn.model_selection.PredefinedSplit(). It takes a parameter called test_fold, which is a list and has the same size as your input data. In the list, you set all samples belonging to training set as -1 and others as 0.
  2. Create a GridSearchCV object with cv="the created PredefinedSplit object".

Then, GridSearchCV will generate only 1 train-validation split, which is defined in test_fold.

Lin Du
  • 88,126
  • 95
  • 281
  • 483
Masanarok
  • 29
  • 1
  • 1
  • 1
    When I tried `PredefinedSplit(test_fold=[-1]*len(X_train))`, I got the error `ValueError: No fits were performed. Was the CV iterator empty? Were there no candidates?`. Though I might be mis-understanding something about your approach. I had to use `test_fold=[0]*len(X_train)`. – edesz Mar 14 '20 at 22:53
  • @edesz I got the same error, did you find a solution? The docstring says that 0 is included in the test set. – Sjotroll Oct 18 '22 at 13:12
  • @Sjotroll My only approach was to use `[0]*len(X_train)`. This seemed to work for me, but I don't fully understand it. Unfortunately, I don't have a better explanation for the root cause of the error message. – edesz Oct 19 '22 at 17:02
0

A parallelized solution using ParameterGrid

from sklearn.model_selection import ParameterGrid
from joblib import Parallel, delayed

param_grid = {'a': [1, 2], 'b': [True, False]}
param_candidates = ParameterGrid(param_grid)
print(f'{len(param_candidates)} candidates')

def fit_model(params):
    model = estimator.set_params(**params)
    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)
    return [params, score]

results = Parallel(n_jobs=-1, verbose=10)(delayed(fit_model)(params) for params in param_candidates)
print(max(results, key=lambda x: x[1]))
Nermin
  • 749
  • 7
  • 17