1

I want to save a text classifier. I am using a TfidfVectorizer in my code like this:

vectorizer = TfidfVectorizer(analyzer='word', tokenizer=tokenize, lowercase=True, stop_words='english',
                             max_features=1100)
corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())

# Convering the document term matrix to numpy nd array
corpus_data_features_nd = (corpus_data_features.toarray())
calibrated_svc.fit(X=corpus_data_features_nd[0:len(train_data_df)], y=train_data_df.Domain)
test_pred=calibrated_svc.predict(corpus_data_features_nd[len(train_data_df):])

so afer training I can save a model and reuse it. but when I want to reuse the model I must create corpus_data_features again:

 corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())

and this kind of saving classifier could not help the speed of classifying. how can I split corpus_data_features into two part and use a saved vector for train_data_df and then add test_data_df when I load my model?

mrmrn
  • 65
  • 4
  • 13
  • Use joblib or pickle to save the tfidfvectorizer – Vivek Kumar Apr 07 '18 at 07:33
  • 1
    After saving, it is useless because of must calculate a new tfidf with new words+ training data. I think I must split 'corpus_data_features = vectorizer.fit_transform(train_data_df.Text.tolist() + test_data_df.Text.tolist())' but how? – mrmrn Apr 07 '18 at 08:04
  • 2
    No. Only call fit() or fit_transform() on training data. To convert the testing data, only call transform(). So its not useless. Please check the usage. When you do fit_transform on testing data, you are leaking knowledge about unseen data to the model and it will give unrealistic results. – Vivek Kumar Apr 07 '18 at 10:40
  • 1
    corpus_train_data_features = vectorizer.fit_transform(train_data_df.Text.tolist()) corpus_test_data_features = vectorizer.transform(test_data_df.Text.tolist()) I used this and the results are surprisingly better than before!thank you @VivekKumar. I think now I can save TfidfVectorizer by vocabulary property to a dump file by pickle. – mrmrn Apr 07 '18 at 20:41
  • 1
    You can store the whole tfidfvectorizer object – Vivek Kumar Apr 09 '18 at 05:27
  • Please refer to the stack overflow link : https://stackoverflow.com/a/52337642/5117127 – Iyyappan Amirthalingam Oct 08 '19 at 14:33

0 Answers0