How to pickle files > 2 GiB by splitting them into smaller fragments

Question

I have a classifier object that is larger than 2 GiB and I want to pickle it, but I got this:

cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)

OverflowError: cannot serialize a string larger than 2 GiB

I found this question that has the same problem and it was suggested there to either

use Python 3 protocol 4 - Not acceptable as I need to use Python 2
use from pyocser import ocdumps, ocloads - Not acceptable as I can't use other (non-trivial) modules
break the object into bytes and pickle each fragment

Is there a way to do so with my classifier? i.e. turn it into bytes, split, pickle, unpickle, concatenate the bytes, and use the classifier?

My code:

from sklearn.svm import SVC 
import cPickle

def train_clf(X,y,clf_name):
    start_time = time.time()
    # after many tests, this was found to be best classifier
    clf = SVC(C = 0.01, kernel='poly')
    clf.fit(X,y)
    print 'fit done... {} seconds'.format(time.time() - start_time)
    with open(clf_name, "wb") as fo:
        cPickle.dump(clf, fo,  protocol = cPickle.HIGHEST_PROTOCOL) 
        # cPickle.HIGHEST_PROTOCOL == 2 
        # the error occurs inside the dump method
    return time.time() - start_time

after this, I want to unpickle and use:

with open(clf_name, 'rb') as fo:
     clf, load_time = cPickle.load(fo), time.time()

why not using: from sklearn.externals import joblib;joblib.dump(clf, 'filename.pkl') — Alok Nayak, Jan 03 '18 at 09:24
@AlokNayak I wasn't familiar with it. It seems better, and I'll try it out now. I'll back in 24 hours (the fit takes forever) — CIsForCookies, Jan 03 '18 at 09:26
@AlokNayak Want to add it as an answer and I'll accept if everything goes ok? — CIsForCookies, Jan 03 '18 at 09:32
You should use joblib for scikit-learn objects, because it handles numpy arrays separately, unlike pickle. — Eli Korvigo, Jan 03 '18 at 09:37

Alok Nayak · Accepted Answer · 2020-12-07T13:44:37.913

4

You can use sklearn.external.joblib which automatically split the model file into pickled numpy array files if model size is large

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl')

Update: sklearn will show

DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

So use this one instead.

import joblib
joblib.dump(clf, 'filename.pkl')

which can be unpickled later using:

clf = joblib.load('filename.pkl')

edited Dec 07 '20 at 13:44

answered Jan 03 '18 at 09:36

Alok Nayak

2,381
22
28

Or as simple as `import joblib` – Eli Korvigo Jan 03 '18 at 09:41
@eli korvigo Joblib (http://pythonhosted.org/joblib/) Is this joblib same as 'from sklearn.externals import joblib'? There might be some sklearn specific code modifications I am not sure. – Alok Nayak Jan 03 '18 at 09:47
BTW, I see that only one file was saved during this `joblib.dump(clf, 'filename.pkl')` - is it Ok? shouldn't each numpy array be saved alone? – CIsForCookies Jan 03 '18 at 09:52
1

@CIsForCookies it's okay, `joblib` saves multiple files if necessary (you can specify a compression level to always save a single file). – Eli Korvigo Jan 03 '18 at 10:20
1

@AlokNayak there is nothing special about serialisation, though they've implemented additional wrappers for other joblib features (most notably, parallelisation). – Eli Korvigo Jan 03 '18 at 10:24

How to pickle files > 2 GiB by splitting them into smaller fragments

1 Answers1