16

Lately, I have been working on applying grid search cross validation (sklearn GridSearchCV) for hyper-parameter tuning in Keras with Tensorflow backend. An soon as my model is tuned I am trying to save the GridSearchCV object for later use without success.

The hyper-parameter tuning is done as follows:

x_train, x_val, y_train, y_val = train_test_split(NN_input, NN_target, train_size = 0.85, random_state = 4)

history = History() 
kfold = 10


regressor = KerasRegressor(build_fn = create_keras_model, epochs = 100, batch_size=1000, verbose=1)

neurons = np.arange(10,101,10) 
hidden_layers = [1,2]
optimizer = ['adam','sgd']
activation = ['relu'] 
dropout = [0.1] 

parameters = dict(neurons = neurons,
                  hidden_layers = hidden_layers,
                  optimizer = optimizer,
                  activation = activation,
                  dropout = dropout)

gs = GridSearchCV(estimator = regressor,
                  param_grid = parameters,
                  scoring='mean_squared_error',
                  n_jobs = 1,
                  cv = kfold,
                  verbose = 3,
                  return_train_score=True))

grid_result = gs.fit(NN_input,
                    NN_target,
                    callbacks=[history],
                    verbose=1,
                    validation_data=(x_val, y_val))

Remark: create_keras_model function initializes and compiles a Keras Sequential model.

After the cross validation is performed I am trying to save the grid search object (gs) with the following code:

from sklearn.externals import joblib

joblib.dump(gs, 'GS_obj.pkl')

The error I am getting is the following:

TypeError: can't pickle _thread.RLock objects

Could you please let me know what might be the reason for this error?

Thank you!

P.S.: joblib.dump method works well for saving GridSearchCV objects that are used for the training MLPRegressors from sklearn.

seralouk
  • 30,938
  • 9
  • 118
  • 133
E.Thrampoulidis
  • 179
  • 1
  • 1
  • 5

3 Answers3

20

Use

import joblib directly

instead of

from sklearn.externals import joblib

Save objects or results with:

joblib.dump(gs, 'model_file_name.pkl')

and load your results using:

joblib.load("model_file_name.pkl")

Here is a simple working example:


import joblib

#save your model or results
joblib.dump(gs, 'model_file_name.pkl')

#load your model for further usage
joblib.load("model_file_name.pkl")

liedji
  • 739
  • 10
  • 11
9

Try this:

from sklearn.externals import joblib
joblib.dump(gs.best_estimator_, 'filename.pkl')

If you want to dump your object into one file - use:

joblib.dump(gs.best_estimator_, 'filename.pkl', compress = 1)

Simple Example:

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
gs = GridSearchCV(svc, parameters)
gs.fit(iris.data, iris.target)

joblib.dump(gs.best_estimator_, 'filename.pkl')

#['filename.pkl']

EDIT 1:

you can also save the whole object:

joblib.dump(gs, 'gs_object.pkl')
seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Thank you for your reply!What you proposed, if I am not mistaken, is the way to save only the model with the best tuned parameters (best estimator). However, what I want to do it to save all the information contained in the GridSearchCV object, meaning the performance information of all trained models. One way would be to save the gs.cv_results_ and not the whole object but I am just wondering why am I not allowed to save in a file the whole object. – E.Thrampoulidis Jul 20 '18 at 10:34
  • 1
    You can save the whole object using `joblib.dump(gs, 'gs_object.pkl')`. See my edited answer – seralouk Jul 20 '18 at 11:21
  • As stated in my question I have already tried this method to save the whole object and it does not work. I still haven't figured out why. – E.Thrampoulidis Aug 02 '18 at 15:41
  • 1
    @E.Thrampoulidis I am working on this myself. The problem is that GridSearchCV is intended to support parallelism through the n_jobs argument. As far as I know, there is no easy way to pickle an object supporting parallel calls (hence the error about pickling threads). Pickle is great for simple data structures such as a dictionary (cv_results), but it isn't a good choice for complex objects (such as the GridSearchCV class) which was never intended for serialization in the first place. – campellcl Aug 13 '19 at 03:31
  • `joblib` is deprecated as of scikit 0.21 and will be removed in 0.23. Now, it needs to be installed as a separate package either through pip (`pip install joblib`) or [conda](https://anaconda.org/anaconda/joblib) (`conda install -c anaconda joblib`) – Arturo Moncada-Torres Apr 07 '20 at 22:44
1

Subclass the sklearn.model_selection._search.BaseSearchCV class. Override the fit(self, X, y=None, groups=None, **fit_params) method, and modify its internal evaluate_candidates(candidate_params) function. Instead of immediately returning the results dictionary from evaluate_candidates(candidate_params), perform your serialization here (or in the _run_search method depending on your use case). With some additional modifications, this approach has the added benefit of allowing you to execute the grid search sequentially (see the comment in the source code here: _search.py). Note that the results dictionary returned by evaluate_candidates(candidate_params) is the same as the cv_results dictionary. This approach worked for me, but I was also attempting to add save-and-restore functionality for interrupted grid search executions.

campellcl
  • 182
  • 2
  • 14
  • Hi Chris! Have you been able to save and resume an interrupted grid search? I'd like to do something similar with BayesSearchCV (from Scikit-Optimize library) which uses a similar interface to GridSearchCV. – SergeGardien Sep 18 '19 at 20:47
  • @SergeGardien Yes, but it is not a quick fix. You have to modify some methods in the core lib. Better off just maintaining your own cv_results dictionary and serializing and restoring from that. – campellcl Sep 19 '19 at 20:49
  • Understood, thanks. The problem is that BayesSearchCV is path dependent, differently from GridSearchCV, and I don't think that simply storing cv_results is enough to have all the information to resume the procedure. Anyway, I'll take a look if I can find some time, otherwise I'll try not do be in a situation where I need to resume the optimization procedure. – SergeGardien Sep 22 '19 at 14:25
  • @SergeGardien I would be happy to provide more details as soon as I get a chance. Good luck! – campellcl Sep 23 '19 at 15:39