What's the most efficient way to serialize a scikit-learn classifier?
I'm currently using Python's standard Pickle module to serialize a text classifier, but this results in a monstrously large pickle. The serialized object can be 100MB or more, which seems excessive and takes a while to generate and store. I've done similar work with Weka, and the equivalent serialized classifier is usually just a couple of MBs.
Is scikit-learn possibly caching the training data, or other extraneous info, in the pickle? If so, how can I speed up and reduce the size of serialized scikit-learn classifiers?
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1,4))),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC())),
])