2

I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.

For instance, what I did so far is:

vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)

Then I wrote 2 functions in order to be able to save and load my vectorizer:

def save_model(model,name):
    '''
    Function that enables us to save a trained model

    '''
    joblib.dump(model, '{}.pkl'.format(name)) 


def load_model(name):
    '''
    Function that enables us to load a saved model

    '''
    return joblib.load('{}.pkl'.format(name))

I checked posts like the one below but i still didn't manage to make much sense.

How do I store a TfidfVectorizer for future use in scikit-learn?

What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.

One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.

I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?

First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of

vectorizer = TfidfVectorizer(stop_words=stop)

created in a different script.

Thank you in advance.

Community
  • 1
  • 1
Swan87
  • 421
  • 6
  • 23
  • To clarify, you have a "training" set that you want to convert to a matrix of tf-idf vectors. Then you want to save that `m x n` matrix. Later, in a new session, you want to reload that matrix and use it to calculate cosine distance to a query? Just making sure I understand before writing an answer – Nick Becker Sep 21 '16 at 14:50
  • Yeah that is what I wish to do! To elaborate a bit more my ultimate goal would be to have a time consuming training session that will return the matrix you mentioned and then by loading that matrix, to transform an "unseen" query. I wasn't sure whether by just saving the vectoriser I could re-load it and use that to initialise a new instance with previously extracted vectors. – Swan87 Sep 22 '16 at 08:15
  • You should be all set with the already written answer. Save the vectorizer and also save `vect_train`. Reload vect_train so you have your tf-idf matrix. Then reload the vectorizer and use it to transform the new query into the same `m x n` space. From there, you can use any number of methods to calculate cosine distance/similarity. – Nick Becker Sep 22 '16 at 13:21

1 Answers1

4

The result of your vect_train = vectorizer.fit_transform(corpus) is twofold: (i) the vectorizer fits your data, that is it learns the corpus vocabulary and the idf for each term, and (ii) vect_train is instantiated with the vectors of your corpus.

The save_model and load_model functions you propose persist and load the vectorizer, that is the internal parameters that it has learned such as the vocabulary and the idfs. Having loaded the vectorizer, all you need to get vectors is to transform a list with data. It can be unseen data, or the raw data you used during the fit_transform. Therefore, all you need is:

vectorizer = load_model(name)
vect_train = vectorizer.transform(corpus) # (1) or any unseen data

At this point, you have everything you had before saving, but the transformation call (1) will take some time depending on your corpus. In case you want to skip this, you need to also save the content of vect_train, as you correctly wonder in your question. This is a sparse matrix and can be saved/loaded using scipy, you can find information in this question for example. Copying from that question, to actually save the csr matrices you also need:

def save_sparse_csr(filename,array):
    np.savez(filename,data = array.data ,indices=array.indices,
             indptr =array.indptr, shape=array.shape )

def load_sparse_csr(filename):
    loader = np.load(filename)
    return csr_matrix((  loader['data'], loader['indices'], loader['indptr']),
                         shape = loader['shape'])

Concluding, the above functions can be used for saving/loading your vec_train whereas the ones you provided for saving/loading the transformer in order to vectorize the new data.

Community
  • 1
  • 1
geompalik
  • 1,582
  • 11
  • 22
  • I will try this and hope for it to work! Dumb follow up question: As I mentioned above when saving `vect_train` I get 4 files i.e. `tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy`. When loading it back I suppose I should use `tfidf.pkl` and every related file would be loaded in as well or I should find a way to load them all to the new vectoriser? Cheers! – Swan87 Sep 22 '16 at 08:18
  • just tdidf.pkl. It is split in more than one files, when the volume of data to be serialized is big. – geompalik Sep 22 '16 at 08:40
  • If you are still struggling, the next time you save `vect_train`, the tf-idf matrix, use `save_sparse_csr` to save it as a .npz file. Then use `load_sparse_csr` to bring it back in. You can still use your save/load for the vectorizer. – Nick Becker Sep 22 '16 at 13:23