Im trying to make a prediction model from text reviews. So I'm going to guess how many stars(1,2,3,4,5) a product will get, based on text from previous reviews.
I followed the scikit tutorial on text data, but my model always predicts 5 star ratings, giving 66% success rate.
How can I make sure my model doesn't simply predict the largest class every time?
Here's the data (700MB): Movies and TV 5-core (1,697,533 reviews)
Here's my subset of the data (1MB): Movies and TV first 1000 rows
Im using the first 1000 rows for testing, when I add more the prediction simply gets worse, for 10000 rows, the score is 0.6.
Distribution of ratings the first 1000 rows:
5 678
4 133
1 70
3 69
2 50
Here's my code:
import pandas as pd
import numpy as np
# Select columns
df = data[['reviewText','overall']]
# Make a smaller set while creating model
df_small = df.head(1000)
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df_small[['reviewText']], df_small[['overall']],
test_size=0.1, random_state=42)
X_train = X_train.values.ravel() # https://stackoverflow.com/a/26367429
X_test = X_test.values.ravel()
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()
# https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# Fit
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, y_train)
# Test
docs_new = X_test
X_new_counts = vectorizer.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
np.mean(predicted == y_test)
Output: 0.66