How to efficiently serialize a scikit-learn classifier

Question

What's the most efficient way to serialize a scikit-learn classifier?

I'm currently using Python's standard Pickle module to serialize a text classifier, but this results in a monstrously large pickle. The serialized object can be 100MB or more, which seems excessive and takes a while to generate and store. I've done similar work with Weka, and the equivalent serialized classifier is usually just a couple of MBs.

Is scikit-learn possibly caching the training data, or other extraneous info, in the pickle? If so, how can I speed up and reduce the size of serialized scikit-learn classifiers?

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1,4))),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC())),
])

Did you use the protocol -1 in cPickle? That often makes an uncanny difference. — Andreas Mueller, Jul 11 '13 at 17:44

score 5 · Accepted Answer · edited May 23 '17 at 12:15

5

For large text datasets, use the hashing trick: replace the TfidfVectorizer by a HashingVectorizer (potentially stacked with a TfidfTransformer in the pipeline): it will be much faster to pickle as you won't have to store the vocabulary dict any more as discussed recently in this question:

How can i reduce memory usage of Scikit-Learn Vectorizers?

edited May 23 '17 at 12:15

Community

1
1

answered Jul 11 '13 at 08:02

ogrisel

39,309
12
116
125

Thanks. That and using joblib reduced size by about 20-30%. Not huge but decent. – Cerin Jul 12 '13 at 17:34

score 4 · Answer 2 · answered Nov 11 '15 at 19:49

4

You can also use joblib.dump and pass in a compression. I noticed my classifier pickle dumps reducing by a factor of ~16 using option compress=3.

answered Nov 11 '15 at 19:49

Shayan Masood

1,057
10
20

How to efficiently serialize a scikit-learn classifier

2 Answers2