Save and reuse TfidfVectorizer in scikit learn

Question

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object to reuse it later. I tried to use pickle, but it gave the following error.

loc=open('vectorizer.obj','w')
pickle.dump(self.vectorizer,loc)
*** TypeError: can't pickle instancemethod objects

I tried using joblib in sklearn.externals, which again gave similar error. Is there any way to save this object so that I can reuse it later?

Here is my full object:

class changeToMatrix(object):
    def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
        from sklearn.feature_extraction.text import TfidfVectorizer
        self.vectorizer = TfidfVectorizer(ngram_range=ngram_range,analyzer='word',lowercase=True,
                                          token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',
                                          tokenizer=tokenizer)

    def load_ref_text(self,text_file):
        textfile = open(text_file,'r')
        lines = textfile.readlines()
        textfile.close()
        sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
        sentences = [item.strip().strip('.') for item in sent_tokenizer.tokenize(' '.join(lines).strip())]
        #vectorizer is transformed in this step
        chk2 = pd.DataFrame(self.vectorizer.fit_transform(sentences1).toarray())
        return sentences, [chk2]

    def get_processed_data(self,data_loc):
        ref_sentences,ref_dataframes=self.load_ref_text(data_loc)
        loc = open("indexedData/vectorizer.obj","w")
        pickle.dump(self.vectorizer,loc) #getting error here
        loc.close()
        return ref_sentences, ref_dataframes

alvas · Accepted Answer · 2015-06-15T16:26:27.370

Firstly, it's better to leave the import at the top of your code instead of within your class:

from sklearn.feature_extraction.text import TfidfVectorizer
class changeToMatrix(object):
  def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()):
    ...

Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it returns a list of strings.

class StemTokenizer(object):
    def __init__(self):
        self.ignore_set = {'footnote', 'nietzsche', 'plato', 'mr.'}

    def __call__(self, doc):
        words = []
        for word in word_tokenize(doc):
            word = word.lower()
            w = wn.morphy(word)
            if w and len(w) > 1 and w not in self.ignore_set:
                words.append(w)
        return words

Now to answer your actual question, it's possible that you need to open a file in byte mode before dumping a pickle, i.e.:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from nltk import word_tokenize
>>> import cPickle as pickle
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=word_tokenize)
>>> vectorizer
TfidfVectorizer(analyzer='word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(0, 2), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents='unicode', sublinear_tf=False,
        token_pattern='[a-zA-Z0-9]+',
        tokenizer=<function word_tokenize at 0x7f5ea68e88c0>, use_idf=True,
        vocabulary=None)
>>> with open('vectorizer.pk', 'wb') as fin:
...     pickle.dump(vectorizer, fin)
... 
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk 
-rw-rw-r-- 1 alvas alvas 763 Jun 15 14:18 vectorizer.pk

Note: Using the with idiom for i/o file access automatically closes the file once you get out of the with scope.

Regarding the issue with SnowballStemmer(), note that SnowballStemmer('english') is an object while the stemming function is SnowballStemmer('english').stem.

IMPORTANT:

TfidfVectorizer's tokenizer parameter expects to take a string and return a list of string
But Snowball stemmer does not take a string as input and return a list of string.

So you will need to do this:

>>> from nltk.stem import SnowballStemmer
>>> from nltk import word_tokenize
>>> stemmer = SnowballStemmer('english').stem
>>> def stem_tokenize(text):
...     return [stemmer(i) for i in word_tokenize(text)]
... 
>>> vectorizer = TfidfVectorizer(ngram_range=(0,2),analyzer='word',lowercase=True, token_pattern='[a-zA-Z0-9]+',strip_accents='unicode',tokenizer=stem_tokenize)
>>> with open('vectorizer.pk', 'wb') as fin:
...     pickle.dump(vectorizer, fin)
...
>>> exit()
alvas@ubi:~$ ls -lah vectorizer.pk 
-rw-rw-r-- 1 alvas alvas 758 Jun 15 15:55 vectorizer.pk

Opening the file in byte mode did not work. But I figured out the issue. It was the StemTokenizer class making the issue. While initializing that class, I had given "self.snowball_stemmer = SnowballStemmer('english')" . When I changed this part into the __call__ part , it worked. I am not sure why it worked though. — Joswin K J, Jun 15 '15 at 13:09
You need to make sure that whatever the tokenizer function is, it returns a list of string. — alvas, Jun 15 '15 at 13:17
It returns a list of strings only. The error was removed when I changed `self.snowball_stemmer = SnowballStemmer('english')` to `snowball_stemmer = SnowballStemmer('english')`. Basically I removed this from the attributes of the class and the error was fixed. — Joswin K J, Jun 15 '15 at 13:42
Ahhh. it's because `SnowballStemmer('english')` is an object, what you need is an iterable using `SnowballStemmer('english').stem` — alvas, Jun 15 '15 at 13:56
Hi! I am trying to save a Pickle for transforming text with TfidfVectorizer, it is 76MB size and i need to reduce it to 10MB. Does the parameter dtype= will help to reduce size? — anitasp, Sep 24 '18 at 22:13

score 1 · Answer 2 · answered Jun 21 '23 at 21:21

If you arrived at this Q/A to look into pickling a Vectorizer to save space on disk, you can either use joblib that comes with scikit-learn with compress=True or use the built-in gzip module along with pickle. A working example would look like the following. It compresses the file to be at least 2 times smaller for my use cases.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
import joblib
import pickle
import gzip

data = fetch_20newsgroups().data
tvec = TfidfVectorizer()
tvec.fit(data)

# option #1
joblib.dump(tvec, 'vectorizer.pkl', compress=True)

# option #2
with gzip.open('vectorizer.pkl', 'wb') as f:
    pickle.dump(tvec, f)

Save and reuse TfidfVectorizer in scikit learn

2 Answers2

Linked

Related