How to save classifier in sklearn with Countvectorizer() and TfidfTransformer()

Question

Below is some code for a classifier. I used pickle to save and load the classifier instructed in this page. However, when I load it to use it, I cannot use the CountVectorizer() and TfidfTransformer() to convert raw text into vectors that the classifier can use.

The only I was able to get it to work is analyze the text immediately after training the classifier, as seen below.

import os
import sklearn
from sklearn.datasets import load_files

from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

from sklearn.feature_extraction.text import CountVectorizer
import nltk

import pandas
import pickle

class Classifier:

    def __init__(self):

        self.moviedir = os.getcwd() + '/txt_sentoken'

    def Training(self):

        # loading all files. 
        self.movie = load_files(self.moviedir, shuffle=True)


        # Split data into training and test sets
        docs_train, docs_test, y_train, y_test = train_test_split(self.movie.data, self.movie.target, 
                                                                  test_size = 0.20, random_state = 12)

        # initialize CountVectorizer
        self.movieVzer = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize, max_features=5000)

        # fit and tranform using training text 
        docs_train_counts = self.movieVzer.fit_transform(docs_train)


        # Convert raw frequency counts into TF-IDF values
        self.movieTfmer = TfidfTransformer()
        docs_train_tfidf = self.movieTfmer.fit_transform(docs_train_counts)

        # Using the fitted vectorizer and transformer, tranform the test data
        docs_test_counts = self.movieVzer.transform(docs_test)
        docs_test_tfidf = self.movieTfmer.transform(docs_test_counts)

        # Now ready to build a classifier. 
        # We will use Multinominal Naive Bayes as our model


        # Train a Multimoda Naive Bayes classifier. Again, we call it "fitting"
        self.clf = MultinomialNB()
        self.clf.fit(docs_train_tfidf, y_train)


        # save the model
        filename = 'finalized_model.pkl'
        pickle.dump(self.clf, open(filename, 'wb'))

        # Predict the Test set results, find accuracy
        y_pred = self.clf.predict(docs_test_tfidf)

        # Accuracy
        print(sklearn.metrics.accuracy_score(y_test, y_pred))

        self.Categorize()

    def Categorize(self):
        # very short and fake movie reviews
        reviews_new = ['This movie was excellent', 'Absolute joy ride', 'It is pretty good', 
                      'This was certainly a movie', 'I fell asleep halfway through', 
                      "We can't wait for the sequel!!", 'I cannot recommend this highly enough', 'What the hell is this shit?']

        reviews_new_counts = self.movieVzer.transform(reviews_new)         # turn text into count vector
        reviews_new_tfidf = self.movieTfmer.transform(reviews_new_counts)  # turn into tfidf vector


        # have classifier make a prediction
        pred = self.clf.predict(reviews_new_tfidf)

        # print out results
        for review, category in zip(reviews_new, pred):
            print('%r => %s' % (review, self.movie.target_names[category]))

jun · Accepted Answer · 2023-05-02T14:44:35.377

13

With MaximeKan's suggestion, I researched a way to save all 3.

saving the model and the vectorizers

import pickle

with open('finalized_model.pkl', 'wb') as fout:
    pickle.dump((movieVzer, movieTfmer, clf), fout)

loading the model and the vectorizers for use

import pickle

with open('finalized_model.pkl', 'rb') as f:
    movieVzer, movieTfmer, clf = pickle.load(f)

edited May 02 '23 at 14:44

answered Sep 21 '19 at 01:06

jun

540
5
17

score 2 · Answer 2 · answered Sep 20 '19 at 00:40

This is happening because you should not only save the classifier, but also the vectorizers. Otherwise, you are retraining the vectorizers on unseen data, which obviously will not contain the exact same words than the train data, and the dimension will change. This is an issue, because your classifier is expecting a certain input format to be provided.

Thus, the solution for your problem is quite simple: you should also save your vectorizers as pickle files and load them along with your classifier before using them.

Note: to avoid having two objects to save and to load, you could consider putting them together in a pipeline, which is equivalent.

How to save classifier in sklearn with Countvectorizer() and TfidfTransformer()

2 Answers2