1

I am currently doing an assignment for my data analysis course at uni. I manage to do the first two parts without many problems (EDA and text processing). I now need to do this:

Build a regression model that will predict the rating score of each product based on attributes which correspond to some very common words used in the reviews (selection of how many words is left to you as a decision). So, for each product you will have a long(ish) vector of attributes based on how many times each word appears in reviews of this product. Your target variable is the rating.

I find myself a bit lost on how to tackle this problem. Here is a link to the dataset I am using. Review2 is the lemmatized version of Review.

Any insight on how to solve this would be greatly appreciated!

P.S: I'm not posting here to get a full solution... Just a push in the right direction


EDIT:

This is the code I wrote for my Ordinal regression (would it be possible to have some feedback):

# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)

# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]

# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)

# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy'))

# Plot
pca = PCA(n_components = 1) 
pca.fit(x_validate)
x_validate = pca.transform(x_validate)

plt.scatter(x_validate, y_validate,  color='black')
plt.plot(x_validate, y_pred, color='blue', linewidth=1)
plt.show()

This is what the plot looks like (Took it from here):

enter image description here

Would it be possible to have some feedback on the code and, possible a better, more informative, way to plot the results (I don't really understand if the regression is performing well or not)

Stefano Pozzi
  • 529
  • 2
  • 10
  • 26
  • 1
    Hi, you are asking SO to write code for you. While many users will be happy to help, you should show what you have done to address the problem - i.e. show us some code of how you think it could be done and make a [minimal, complete and verifiable example (MCVE)](https://stackoverflow.com/help/mcve). You have tagged this post pandas, so here is my favorite answer explaining how to make an [MCVE in pandas](https://stackoverflow.com/a/32536193/7480990). – tobsecret Jul 10 '18 at 18:03
  • 1
    @tobsecret I would provide some code if I had any. My issue is literally on how to start to tackle this problem, I don't really see how I can make a relation between `words` and `ratings`. Should I use the top 50 words that occur throughout the reviews and make a correlation to their respective average rating? I just don't really see how to start; thus me not being able to provide any code. I'm just looking for "logical help" and not so much "practical help" – Stefano Pozzi Jul 10 '18 at 18:09
  • 1
    @tobsecret, because the OP's request for "just a push in the right direction" is very different from "you are asking SO to write code for you", I do not understand your comment. would you please clarify that part of your comment? – James Phillips Jul 10 '18 at 19:13
  • 1
    The push in the right direction was edited in after I commented, thanks for the clarification. – tobsecret Jul 10 '18 at 20:29
  • 1
    The relation between `words` and `ratings` is exactly what is covered by your model. Since you are dealing with an ordinal variable (meaning that a rating of 5 is more similar to a rating of 4 than it is to a rating of 3), the modelling problem is that of ordinal regression. You've tagged the question with the `scikit-learn` tag, but it's not clear from your question if you have to somehow use the framework. In any case, it looks like https://pythonhosted.org/mord/ has an API design inspired by scikit-learn and so could be useful. Otherwise implementing it by hand is a good exercise. – fuglede Jul 10 '18 at 21:31

1 Answers1

3

Build a regression model that will predict the rating score of each product based on attributes which correspond to some very common words used in the reviews (selection of how many words is left to you as a decision). So, for each product you will have a long(ish) vector of attributes based on how many times each word appears in reviews of this product. Your target variable is the rating.

Let's pull this apart into several pieces!

So, for each product you will have a long(ish) vector of attributes based on how many times each word appears in reviews of this product.

This is a bag-of-words model, meaning you will have to create a matrix representation (still held in a pd.DataFrame) of your words column or your review 2 column and there is a question asking how to do that here:

How to create a bag of words from a pandas dataframe

Below is a minimal example of how you can create that matrix with your Review2 column:

In [12]: import pandas as pd
In [13]: df = pd.DataFrame({"Review2":['banana apple mango', 'apple apple strawberry']})
In [14]: df
Out[14]: 
                  Review2
0      banana apple mango
1  apple apple strawberry

In [15]: df.Review2.str.split()
Out[15]: 
0        [banana, apple, mango]
1    [apple, apple, strawberry]
Name: Review2, dtype: object
In [16]: df = df.Review2.str.split().apply(pd.Series.value_counts) # this will produce the word count matrix
In [17]: df 
Out[17]: 
   apple  banana  mango  strawberry
0    1.0     1.0    1.0         NaN
1    2.0     NaN    NaN         1.0

The bag-of-words model just counts how often a word occurs in a text of interest, with no regard for positions and represents a set of texts this way as a matrix where the texts are each represented by a row and the columns show the counts for all the words.

[...] based on attributes which correspond to some very common words used in the reviews (selection of how many words is left to you as a decision).

Now that you have your matrix representation (rows are the products, columns are the counts for each unique word), you can filter the matrix down to the most common words. I would encourage you to take a look at how the distribution of word counts looks. We will use seaborn for that and import it like so:

import seaborn as sns

Given that your pd.DataFrame holding the word-count matrix is called df, sns.distplot(df.sum()) should do the trick. Choose some cutoff that seems like it preserves a good chunk of the counts but doesn't include many words with low counts. It can be arbitrary and it doesn't really matter for now. Your word count matrix is your input data, or also called the predictor variable. In machine learning this is often called the input matrix or vector X.

Your target variable is the rating.

The output variable or target variable is the rating column. In machine learning this is often called the output vector y (note that this can sometimes also be an output matrix but most commonly one outputs a vector).

This means our model tries adjust its parameters to map the word count data from each row to the corresponding rating value.

Scikit-learn offers a lot of machine learning models such as logistic regression which take an X and y for training and have a very unified interface. Jake Vanderplas's Python Data Science Handbook explains the Scikit-learn interface nicely and shows you how to use it for a regression problem.

EDIT: We are using logistic regression here but as correctly pointed out by fuglede, logistic regression ignores that ratings are on a set scale. For this you can use mord.OrdinalRidge, the API of which works very similarly to that of scikit-learn.

Before you train your model, you should split your data set into a training, a test and a validation set - a good ratio is probably 60:20:20. This way you will be able to train your model on your training set and evaluate how well it is predicting your test data set, to help you tune your model parameters. This way you will know when your model is overfitting to your training data and when it is genuinely producing a good general model for this task. The problem with this approach is that you can still overfit onto your training data if you adjust model parameters often enough.

This is why we have a validation set - it is to make sure we are not accidentally also overfitting our model's parameters to both our training and test set without knowing it. We only test on the validation set once typically, so as to avoid overfitting to it, also - it is only used in the final model evaluation step.

Scikit-learn has a function for that, too: train_test_split

train_test_split however only makes one split, so you would first split your data set 60:40 and then the 40 you would split 50:50 into test and validation set.

You can now train different models on your training data and test them using the predict function of your model on your test set. Once you think you have done a good job and your model is good enough, you now test it on your validation set.

tobsecret
  • 2,442
  • 15
  • 26
  • 1
    Ordinary logistic regression (a la `sklearn.linear_model.LogisticRegression`) would entirely ignore the ordinal nature of the ratings. This can be captured by the proportional odds model (also known as ordered logistic regression) -- see https://en.wikipedia.org/wiki/Ordinal_regression; I don't believe scikit-learn offers this out of the box, but it's easy to find Python implementations, cf. e.g. https://stackoverflow.com/questions/28035216/ordered-logit-in-python – fuglede Jul 10 '18 at 21:16
  • Appreciate a lot your insight on the problem. Will get started on it tomorrow morning! Thx again! – Stefano Pozzi Jul 10 '18 at 21:19
  • @tobsecret Hi, when I try to do `sns.distplot(df['BOW'].sum())` where `BOW` is my column with the bag of words I get this error: `unsupported operand type(s) for /: 'Counter' and 'int'` would you have any insight? – Stefano Pozzi Jul 12 '18 at 07:57
  • As I wrote in my answer, in df the columns are the words and the rows are the individual reviews and each cell holds the counts for that word in that review. When you call `df.sum()` it will give you a `pd.Series` with summed counts of each word. The answer I linked to for how to create the bag of words model is the problem - you probably used the option 1 from that answer 1:1 where they use Counter. That answer was just to give you a general idea for how to transform a bunch of words into word counts. Let me add a minimal example to my answer that makes it a bit more clear what I mean. – tobsecret Jul 12 '18 at 14:30
  • Added edits, please take a look - hope it is more clear now. – tobsecret Jul 12 '18 at 14:41
  • @tobsecret Hi, I think I finished my regression model. Would it be possible to have some feedback on it? (added code in the edit) – Stefano Pozzi Jul 14 '18 at 11:50