How to load unlabelled data for sentiment classification after training SVM model?

Question

I am trying to do sentiment classification and I used sklearn SVM model. I used the labeled data to train the model and got 89% accuracy. Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?

I used python 3.7. Below is the code.

import random
import pandas as pd
data = pd.read_csv("label data for testing .csv", header=0)
sentiment_data = list(zip(data['Articles'], data['Sentiment']))
random.shuffle(sentiment_data)

train_x, train_y = zip(*sentiment_data[:350])
test_x, test_y = zip(*sentiment_data[350:])

from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn import metrics


clf = Pipeline([
    ('vectorizer', CountVectorizer(analyzer="word",
                                   tokenizer=word_tokenize,
                                   preprocessor=lambda text: text.replace("<br />", " "),
                                   max_features=None)),
    ('classifier', LinearSVC())
])

clf.fit(train_x, train_y)
pred_y = clf.predict(test_x)
print("Accuracy : ", metrics.accuracy_score(test_y, pred_y))
print("Precision : ", metrics.precision_score(test_y, pred_y))
print("Recall : ", metrics.recall_score(test_y, pred_y))

When I run this code, I get the output:

ConvergenceWarning: Liblinear failed to converge, increase the number of iterations. "the number of iterations.", ConvergenceWarning) Accuracy : 0.8977272727272727 Precision : 0.8604651162790697 Recall : 0.925

What is the meaning of ConvergenceWarning?

Thanks in Advance!

score 1 · Answer 1 · answered Nov 18 '19 at 10:23

Check out this site about model's persistence. Then you just load it and call predict method. Model will return predicted label. If you used any encoder (LabelEncoder, OneHotEncoder), you need to dump and load it separately.

If I were you, I'd rather do full data-driven approach and use some pretrained embedder. It'll also work for dozens of languages out-of-the-box with is quite neat.

There's LASER from facebook. There's also pypi package, though unofficial. It works just fine. Nowadays there's a lot of pretrained models, so it shouldn't be that hard to reach near-seminal scores.

PV8 · Accepted Answer · 2019-11-19T08:53:49.267

1

What is the meaning of ConvergenceWarning?

As Pavel already mention, ConvergenceWArning means that the max_iteris hitted, you can supress the warning here: How to disable ConvergenceWarning using sklearn?

Now I want to use the model to predict the sentiment of unlabeled data. How can I do that?

You will do it with the command: pred_y = clf.predict(test_x), the only thing you will adjust is :pred_y (this is your free choice), and test_x, this should be your new unseen data, it has to have the same number of features as your data test_x and train_x.

In your case as you are doing:

sentiment_data = list(zip(data['Articles'], data['Sentiment']))

You are forming a tuple: Check this out then you are shuffling it and unzip the first 350 rows:

train_x, train_y = zip(*sentiment_data[:350])

Here you train_x is the column: data['Articles'], so all you have to do if you have new data:

new_ data = pd.read_csv("new_data.csv", header=0)
new_y = clf.predict(new_data['Articles'])

how to see whether it is classified as positive or negative?

You can run then: pred_yand there will be either a 1 or a 0 in your outcome. Normally 0 should be negativ, but it depends on your dataset-up

edited Nov 19 '19 at 08:53

answered Nov 18 '19 at 11:32

PV8

5,799
7
43
87

I don't understand what you mean by the same number of features. Does that mean the number of the column? As my labeled data has two columns - one for articles, and the other one for the label (sentiment). If I use unlabeled data, then it will only have one column for articles. Can you please write it in code what you mentioned in your second argument? – Piyush Ghasiya Nov 19 '19 at 02:23
Thank you for helping. I did what you mentioned but I am running into the error: ValueError: Found input variables with inconsistent numbers of samples: [88, 179]. My labeled data have 438 rows and 2 columns. And new_data (unlabeled data) has 179 rows and 1 column. – Piyush Ghasiya Nov 20 '19 at 02:10
She shapes must matching, otherwise it will not work, you have to adjust your data... – PV8 Nov 20 '19 at 06:55
I got the 1, but how do I get a polarity for each piece of text in the unlabelled data? – Brndn Jan 27 '20 at 16:10

score -1 · Answer 3 · answered Nov 18 '19 at 09:11

Now I want to use the model to predict the sentiment of unlabeled data. How can I do that? and after classification of unlabeled data, how to see whether it is classified as positive or negative?

Basically, you aggregate unlabeled data in same way as train_x or test_x is generated. Probably, it's 2D matrix of shape n_samples x 1, which you would then use in clf.predict to obtain predictions. clf.predict outputs most probable class. In your case 0 is negative and 1 is positive, but it's hard to tell without the dataset.

What is the meaning of ConvergenceWarning?

LinearSVC model is optimized using iterative algorithm. There is an argument max_iter (1000 by default) that controls maximum amount of iterations. If stopping criteria wasn't met during this process, you will get ConvergenceWarning. It shouldn't bother you much, as long as you have acceptable performance in terms of accuracy, or other metrics.

How to load unlabelled data for sentiment classification after training SVM model?

3 Answers3