Text classifier always predicts the largest class

Question

Im trying to make a prediction model from text reviews. So I'm going to guess how many stars(1,2,3,4,5) a product will get, based on text from previous reviews.

I followed the scikit tutorial on text data, but my model always predicts 5 star ratings, giving 66% success rate.

How can I make sure my model doesn't simply predict the largest class every time?

Here's the data (700MB): Movies and TV 5-core (1,697,533 reviews)

Here's my subset of the data (1MB): Movies and TV first 1000 rows

Im using the first 1000 rows for testing, when I add more the prediction simply gets worse, for 10000 rows, the score is 0.6.

Distribution of ratings the first 1000 rows:

Here's my code:

import pandas as pd
import numpy as np

# Select columns
df = data[['reviewText','overall']]

# Make a smaller set while creating model

df_small = df.head(1000)

# Train test split

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_small[['reviewText']], df_small[['overall']], 
    test_size=0.1, random_state=42)

X_train = X_train.values.ravel() # https://stackoverflow.com/a/26367429
X_test = X_test.values.ravel()
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train) 

from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# Fit

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)

# Test

docs_new = X_test
X_new_counts = vectorizer.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

np.mean(predicted == y_test)

Output: 0.66

`df_small = df.head(1000)` What's the distribution of scores in the first 1000 lines of data? — Blorgbeard, Apr 02 '19 at 21:20
Also, it's great that you uploaded the data for us to test with, but maybe you could cut it down to the subset that you actually use? — Blorgbeard, Apr 02 '19 at 21:23
@Blorgbeard added the count of each rating to the post, thanks. It's super biased towards 5 and even 4 star ratings. Will try to upload the subset. — ViktorMS, Apr 02 '19 at 21:34
I could be very wrong, but this sounds like a class imbalance problem... — m13op22, Apr 02 '19 at 21:44
@HS-nebula I wondered, however, now when I try a subset of balanced classes, 500 of each, my accuracy decreases to 0.3 — ViktorMS, Apr 02 '19 at 22:52

score 0 · Answer 1 · answered Apr 04 '19 at 21:11

Have you tried Stratified Sampling which splits your class proportionally between training and test set.

Also, try looking into F1 Score and your ROC AUC Score.

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=2, test_size=0.1, random_state=42)

for train_index, test_index in splitter.split(df_small[['reviewText']], df_small[['overall']]):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Text classifier always predicts the largest class

1 Answers1