I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.
For instance, what I did so far is:
vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)
Then I wrote 2 functions in order to be able to save and load my vectorizer:
def save_model(model,name):
'''
Function that enables us to save a trained model
'''
joblib.dump(model, '{}.pkl'.format(name))
def load_model(name):
'''
Function that enables us to load a saved model
'''
return joblib.load('{}.pkl'.format(name))
I checked posts like the one below but i still didn't manage to make much sense.
How do I store a TfidfVectorizer for future use in scikit-learn?
What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.
One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.
I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?
First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of
vectorizer = TfidfVectorizer(stop_words=stop)
created in a different script.
Thank you in advance.