56

After identifying the best parameters using a pipeline and GridSearchCV, how do I pickle/joblib this process to re-use later? I see how to do this when it's a single classifier...

import joblib
joblib.dump(clf, 'filename.pkl') 

But how do I save this overall pipeline with the best parameters after performing and completing a gridsearch?

I tried:

  • joblib.dump(grid, 'output.pkl') - But that dumped every gridsearch attempt (many files)
  • joblib.dump(pipeline, 'output.pkl') - But I don't think that contains the best parameters

X_train = df['Keyword']
y_train = df['Ad Group']

pipeline = Pipeline([
  ('tfidf', TfidfVectorizer()),
  ('sgd', SGDClassifier())
  ])

parameters = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'tfidf__max_df': [0.25, 0.5, 0.75, 1.0],
              'tfidf__max_features': [10, 50, 100, 250, 500, 1000, None],
              'tfidf__stop_words': ('english', None),
              'tfidf__smooth_idf': (True, False),
              'tfidf__norm': ('l1', 'l2', None),
              }
              
grid = GridSearchCV(pipeline, parameters, cv=2, verbose=1)
grid.fit(X_train, y_train)

#These were the best combination of tuning parameters discovered
##best_params = {'tfidf__max_features': None, 'tfidf__use_idf': False,
##               'tfidf__smooth_idf': False, 'tfidf__ngram_range': (1, 2),
##               'tfidf__max_df': 1.0, 'tfidf__stop_words': 'english',
##               'tfidf__norm': 'l2'}
cottontail
  • 10,268
  • 18
  • 50
  • 51
Jarad
  • 17,409
  • 19
  • 95
  • 154

2 Answers2

67
import joblib
joblib.dump(grid.best_estimator_, 'filename.pkl')

If you want to dump your object into one file - use:

joblib.dump(grid.best_estimator_, 'filename.pkl', compress = 1)
7ovana
  • 13
  • 3
Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
  • 12
    As a best practice, once the best model has been selected, one should retrain it on the entire dataset. In order to do so, should one retrain the same pipeline object on the entire dataset (thus applying the same data processing) and then deploy that very object? Or should one recreate a new model? – Odisseo Mar 30 '19 at 07:49
  • 2
    @Odisseo - My opinion is that you retrain a new model starting from scratch. You can still use a pipeline, but you change your grid_classifier to your final classifier (say a Random forest). Add that classifier to the pipeline, retrain using all the data. Save the end model. - The end result is your entire data set was trained inside the full pipeline you desire. This may lead to slightly different preprocessing for instance, but it should be more robust. In reality, this means you call pipeline.fit() and save the pipeline. – brian_ds Oct 29 '19 at 17:24
  • 20
    @Odisseo I'm a little bit late but... GridSearchCV automatically retrain the model on the entire dataset, unless you explicitly ask it not to do it. So, when you train the GridSearchCV model, the model you use for predicting (in other words, the best_estimator_) is already retrained on the whole dataset. – Federico Dorato Apr 29 '20 at 03:56
0

I just want to point out that when it comes to the size on disk, saving the GridSearchCV or its best estimator doesn't differ much (for my personal project, it was 1865 KB vs 1801 KB) but compressing makes a world of difference. In other words, passing compress=True (or an integer between 1 and 9) is important.

In the following example, case1.pkl will have a much smaller size on disk than case2.pkl and case3.pkl, while case2.pkl and case3.pkl will have very similar sizes.

import joblib
joblib.dump(grid, 'case1.pkl', compress=True)     # <--- good

joblib.dump(grid, 'case2.pkl')
joblib.dump(grid.best_estimator_, 'case3.pkl')

If you want to use pickle instead of joblib, you can combine it with the built-in gzip to compress it:

import pickle
import gzip

with gzip.open('case4.pkl', 'wb') as f:
    pickle.dump(grid, f)

On a side note, when you load the pickled model, make sure the joblib version is at least as recent as the joblib version that was used to dump the model in the first place. Otherwise, a KeyError may be raised.

cottontail
  • 10,268
  • 18
  • 50
  • 51