How do I store a TfidfVectorizer for future use in scikit-learn?

Question

I have a TfidfVectorizer that vectorizes collection of articles followed by feature selection.

vectroizer = TfidfVectorizer()
X_train = vectroizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

Now, I want to store this and use it in other programs. I don't want to re-run the TfidfVectorizer() and the feature selector on the training dataset. How do I do that? I know how to make a model persistent using joblib but I wonder if this is the same as making a model persistent.

`joblib` seems working. But I guess I have to dump the vectorizer and feature selector independently. — user2161903, Sep 24 '15 at 15:36
When you get a working solution, please post your approach and some code. — Shawn Mehan, Sep 24 '15 at 15:37
http://scikit-learn.org/stable/modules/model_persistence.html has some warnings around security and version management. — scipilot, Mar 23 '18 at 00:36

score 23 · Answer 1 · edited Dec 23 '20 at 18:57

23

You can simply use the built in pickle library:

import pickle
pickle.dump(vectorizer, open("vectorizer.pickle", "wb"))
pickle.dump(selector, open("selector.pickle", "wb"))

and load it with:

vectorizer = pickle.load(open("vectorizer.pickle", "rb"))
selector = pickle.load(open("selector.pickle", "rb"))

Pickle will serialize the objects to disk and load them in memory again when you need it

pickle lib docs

edited Dec 23 '20 at 18:57

Hadij

3,661
5
26
48

answered Sep 29 '15 at 14:15

Marco Ferragina

543
1
4
12

So basically, you replaced joblib with pickle when compared your solution with mine, right? – user2161903 Sep 29 '15 at 16:41
I have tried cPickle, I have tried joblib which uses pickle. For either approach, I get `pickle.PicklingError: Can't pickle : it's not found as __builtin__.instancemethod` How does that work? I am stroring the TfIdfVectorizer object as well. – demongolem Mar 28 '18 at 14:49
@user2161903 he also fixed your typo in "vectroizer" ;) . – petemir Jun 24 '19 at 00:28

score 11 · Answer 2 · edited Dec 23 '20 at 20:13

11

Here is my answer using joblib:

import joblib
joblib.dump(vectorizer, 'vectorizer.pkl')
joblib.dump(selector, 'selector.pkl')

Later, I can load it and ready to go:

vectorizer = joblib.load('vectorizer.pkl')
selector = joblib.load('selector.pkl')

test = selector.trasnform(vectorizer.transform(['this is test']))

edited Dec 23 '20 at 20:13

Hadij

3,661
5
26
48

answered Sep 24 '15 at 17:21

user2161903

577
1
6
22

a little typo on vectorizer – Sinux Nov 23 '20 at 07:18

score 8 · Answer 3 · answered Mar 18 '16 at 15:03

"Making an object persistent" basically means that you're going to dump the binary code stored in memory that represents the object in a file on the hard-drive, so that later on in your program or in any other program the object can be reloaded from the file in the hard drive into memory.

Either scikit-learn included joblib or the stdlib pickle and cPickle would do the job. I tend to prefer cPickle because it is significantly faster. Using ipython's %timeit command:

>>> from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
>>> t = TFIDF()
>>> t.fit_transform(['hello world'], ['this is a test'])

# generic serializer - deserializer test
>>> def dump_load_test(tfidf, serializer):
...:    with open('vectorizer.bin', 'w') as f:
...:        serializer.dump(tfidf, f)
...:    with open('vectorizer.bin', 'r') as f:
...:        return serializer.load(f)

# joblib has a slightly different interface
>>> def joblib_test(tfidf):
...:    joblib.dump(tfidf, 'tfidf.bin')
...:    return joblib.load('tfidf.bin')

# Now, time it!
>>> %timeit joblib_test(t)
100 loops, best of 3: 3.09 ms per loop

>>> %timeit dump_load_test(t, pickle)
100 loops, best of 3: 2.16 ms per loop

>>> %timeit dump_load_test(t, cPickle)
1000 loops, best of 3: 879 µs per loop

Now if you want to store multiple objects in a single file, you can easily create a data structure to store them, then dump the data structure itself. This will work with tuple, list or dict. From the example of your question:

# train
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
selector = SelectKBest(chi2, k = 5000 )
X_train_sel = selector.fit_transform(X_train, y_train)

# dump as a dict
data_struct = {'vectorizer': vectorizer, 'selector': selector}
# use the 'with' keyword to automatically close the file after the dump
with open('storage.bin', 'wb') as f: 
    cPickle.dump(data_struct, f)

Later or in another program, the following statements will bring back the data structure in your program's memory:

# reload
with open('storage.bin', 'rb') as f:
    data_struct = cPickle.load(f)
    vectorizer, selector = data_struct['vectorizer'], data_struct['selector']

# do stuff...
vectors = vectorizer.transform(...)
vec_sel = selector.transform(vectors)

Seems like the speed of pickle increase a lot. I got `%timeit dump_load_test(t, pickle) 433 µs ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)`. Also, you need to set the file open mode to `'wb'` and `'rb'`. — Louis Yang, Oct 10 '18 at 23:59

How do I store a TfidfVectorizer for future use in scikit-learn?

3 Answers3

Linked