Sklearn gridsearchCV object changed after pickle dump/load

Question

I have a gridsearchCV object I created with

grid_search = GridSearchCV(pred_home_pipeline, param_grid)

I would like to save the entire grid-search object so I can explore the model-tuning results later. I do not want to just save the best_estimator_. But after dumping and reloading, the reloaded and original grid_search objects are different in some way which I cannot track down.

# save to disk
with open(filepath, 'wb') as handle:
    pickle.dump(grid_search, handle, protocol=pickle.HIGHEST_PROTOCOL)

# reload
with open(filepath, 'rb') as handle:
    grid_reloaded = pickle.load(handle)

# test object is unchanged after dump/reload
print(grid_search == grid_reloaded)

False

Weird. Looking at the outputs of print(grid_search) and print(grid_reloaded) they certainly look the same.

And they create the exact same set of 525 predicted values for data I held out entirely from the grid-search process:

grid_search_preds  = grid_search.predict(X_test)
grid_reloaded_preds= grid_reloaded.predict(X_test)

(grid_search_preds == grid_reloaded_preds).all()

True

...Even though the best_estimator_ attributes are not technically the same:

grid_search.best_estimator_ == grid_reloaded.best_estimator_

False

...although the best_estimate_ attributes also certainly look the same comparing print(grid_search.best_estimatmator_) and print(grid_reloaded.best_estimator_)

What's going on here? Is it safe to save the gridsearchcv object for inspection later?

I would guess that the grid search objects simply don't define a "functionally based" notion of equality. They're probably only considered equal if they are the exact same object. Try creating two identical GridSearch objects (by running the same creation code twice) and see if they're equal; my guess is they won't be. This may mean you can indeed use the pickled object as usual, but it just won't "look" equal to other equivalent ones (in terms of getting a true value from your `==` tests). — BrenBarn, Mar 23 '17 at 18:38

score 4 · Accepted Answer · edited May 23 '17 at 12:32

4

That's because the comparison is returning whether or not the objects are the same object.

To see why, follow the object hierarchy, you'll see there's no __eq__ function overridden (or __cmp__):

Thus the "==" comparison falls back to a object memory location comparison for which of course your reloaded instance and your current instance cannot be equal. This is comparing to see if they are the same object.

See more here.

edited May 23 '17 at 12:32

Community

1
1

answered Mar 23 '17 at 19:58

lollercoaster

15,969
35
115
173

Thanks a lot. The blog post linked at the stack overflow answer you linked to here is really fascinating. – Max Power Mar 24 '17 at 06:39
@MaxPower please select the checkmark if this answered your question – lollercoaster Apr 05 '17 at 20:27

score 1 · Answer 2 · answered Mar 24 '17 at 07:01

Here's sklearn contributor GaelVaroquaux's answer from sklearn's github on why there's no __eq__ method implemented here, and a solution for testing equality of two sklearn objects:

No, I would rather not add an eq. These things are very difficult to get right, and one should not expect a library to implement eq on complex objects.

One thing that you can do, is use joblib.hash to compute an MD5 hash of the object, and use this for comparison.

Sklearn gridsearchCV object changed after pickle dump/load

2 Answers2