3

I would like to know how to save OnevsRest classifier model for later prediciton.

I have an issue saving it, since it implies saving the vectorizer as well. I have learnt in this post.

Here's the model I have created:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)

x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)

x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)


from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

%%time

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])

for category in categories:
    printmd('**Processing {} comments...**'.format(category))

    # Training logistic regression model on train data
    LogReg_pipeline.fit(x_train, train[category])

    # calculating test accuracy
    prediction = LogReg_pipeline.predict(x_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
    print("\n") 

Any help will be very much appreciated.

Sincerely,

  • @YS-L I'll be grateful if you can help. I have read your [post](https://stackoverflow.com/questions/34069582/how-to-use-save-model-for-prediction-in-python), and I think I have a similar issue, but I can't work it out. Thanks. –  Jan 30 '19 at 19:16

1 Answers1

2

Using joblib you can save any Scikit-learn Pipeline complete of all its elements, therefore comprising also the fitted TfidfVectorizer.

Here I have rewritten your example using the first 200 examples of the Newsgroups20 dataset:

from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')

x_train = data.data[:100]
y_train = data.target[:100]

x_test =  data.data[100:200]
y_test = data.target[100:200]

# Using pipeline for applying logistic regression and one vs rest classifier
LogReg_pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('clf', OneVsRestClassifier(LogisticRegression(solver='sag', 
                                                   class_weight='balanced'), 
                                n_jobs=-1))
                           ])

# Training logistic regression model on train data
LogReg_pipeline.fit(x_train, y_train)

In the above code you simply start defining your train and test data and you instantiate your TfidfVectorizer. You then define your pipeline comprising both the vectorizer and the OVR classifier and you fit it to the training data. It will learn to predict all the classes at once.

Now you simply save the entire fitted pipeline as it were a single predictor using joblib:

from joblib import dump, load
dump(LogReg_pipeline, 'LogReg_pipeline.joblib') 

Your entire model is not saved to disk under the name 'LogReg_pipeline.joblib'. You can recall it and use it directly on raw data by this code snippet:

clf = load('LogReg_pipeline.joblib') 
clf.predict(x_test)

You will get the predictions on the raw text because the pipeline will vectorize it automatically.

Luca Massaron
  • 1,734
  • 18
  • 25
  • grazie mille for your help. I get stuck when trying to run your model with my own data. I get an error telling me `np.nan is an invalid document, expected byte or unicode string` –  Jan 31 '19 at 09:17
  • Maybe this answer could help you: https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document – Luca Massaron Jan 31 '19 at 09:26
  • I appreciate it, but the problem remains. In fact, link provided suggests doing a vectorization transform outside the model, as I did on my original model. As per your former request, there is no need to do separate vectorization as it is now encapsulated in the pipeline. –  Jan 31 '19 at 09:40
  • All you need to do, based on https://stackoverflow.com/questions/39303912/tfidfvectorizer-in-scikit-learn-valueerror-np-nan-is-an-invalid-document, is to pass x_train.astype('U') to the pipeline when training and x_test.astype('U') when predicting (this converts your text data into Unicode) – Luca Massaron Jan 31 '19 at 09:45
  • Thanks for your help. I entered `LogReg_pipeline.fit(x_train.astype('U'), y_train.astype('U'))`, but after I got an error saying `ValueError: Multioutput target data is not supported with label binarization` –  Jan 31 '19 at 10:01
  • It seems you are using binary targets, but OneVsRestClassifier supports multiclass predictions, that is targets of the kind [0,3,4,1,2,2,...] not [0,1,0,0,1,0,...] (see documentation: https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) – Luca Massaron Jan 31 '19 at 11:14
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/187651/discussion-between-josepmaria-and-luca-massaron). –  Jan 31 '19 at 11:21