3

I have a classifier object that is larger than 2 GiB and I want to pickle it, but I got this:

cPickle.dump(clf, fo, protocol = cPickle.HIGHEST_PROTOCOL)

OverflowError: cannot serialize a string larger than 2 GiB

I found this question that has the same problem and it was suggested there to either

  1. use Python 3 protocol 4 - Not acceptable as I need to use Python 2
  2. use from pyocser import ocdumps, ocloads - Not acceptable as I can't use other (non-trivial) modules
  3. break the object into bytes and pickle each fragment

Is there a way to do so with my classifier? i.e. turn it into bytes, split, pickle, unpickle, concatenate the bytes, and use the classifier?


My code:

from sklearn.svm import SVC 
import cPickle

def train_clf(X,y,clf_name):
    start_time = time.time()
    # after many tests, this was found to be best classifier
    clf = SVC(C = 0.01, kernel='poly')
    clf.fit(X,y)
    print 'fit done... {} seconds'.format(time.time() - start_time)
    with open(clf_name, "wb") as fo:
        cPickle.dump(clf, fo,  protocol = cPickle.HIGHEST_PROTOCOL) 
        # cPickle.HIGHEST_PROTOCOL == 2 
        # the error occurs inside the dump method
    return time.time() - start_time

after this, I want to unpickle and use:

with open(clf_name, 'rb') as fo:
     clf, load_time = cPickle.load(fo), time.time()
CIsForCookies
  • 12,097
  • 11
  • 59
  • 124

1 Answers1

4

You can use sklearn.external.joblib which automatically split the model file into pickled numpy array files if model size is large

from sklearn.externals import joblib
joblib.dump(clf, 'filename.pkl') 

Update: sklearn will show

DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.

So use this one instead.

import joblib
joblib.dump(clf, 'filename.pkl') 

which can be unpickled later using:

clf = joblib.load('filename.pkl') 
Alok Nayak
  • 2,381
  • 22
  • 28
  • Or as simple as `import joblib` – Eli Korvigo Jan 03 '18 at 09:41
  • @eli korvigo Joblib (http://pythonhosted.org/joblib/) Is this joblib same as 'from sklearn.externals import joblib'? There might be some sklearn specific code modifications I am not sure. – Alok Nayak Jan 03 '18 at 09:47
  • BTW, I see that only one file was saved during this `joblib.dump(clf, 'filename.pkl')` - is it Ok? shouldn't each numpy array be saved alone? – CIsForCookies Jan 03 '18 at 09:52
  • 1
    @CIsForCookies it's okay, `joblib` saves multiple files if necessary (you can specify a compression level to always save a single file). – Eli Korvigo Jan 03 '18 at 10:20
  • 1
    @AlokNayak there is nothing special about serialisation, though they've implemented additional wrappers for other joblib features (most notably, parallelisation). – Eli Korvigo Jan 03 '18 at 10:24