Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating.
Let's pull this apart into several pieces!
So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product.
This is a bag-of-words model, meaning you will have to create a matrix representation (still held in a pd.DataFrame) of your words column or your review 2 column and there is a question asking how to do that here:
How to create a bag of words from a pandas dataframe
Below is a minimal example of how you can create that matrix with your Review2 column:
In [12]: import pandas as pd
In [13]: df = pd.DataFrame({"Review2":['banana apple mango', 'apple apple strawberry']})
In [14]: df
Out[14]:
Review2
0 banana apple mango
1 apple apple strawberry
In [15]: df.Review2.str.split()
Out[15]:
0 [banana, apple, mango]
1 [apple, apple, strawberry]
Name: Review2, dtype: object
In [16]: df = df.Review2.str.split().apply(pd.Series.value_counts) # this will produce the word count matrix
In [17]: df
Out[17]:
apple banana mango strawberry
0 1.0 1.0 1.0 NaN
1 2.0 NaN NaN 1.0
The bag-of-words model just counts how often a word occurs in a text of interest, with no regard for positions and represents a set of texts this way as a matrix where the texts are each represented by a row and the columns show the counts for all the words.
[...] based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision).
Now that you have your matrix representation (rows are the products, columns are the counts for each unique word), you can filter the matrix down to the most common words. I would encourage you to take a look at how the distribution of word counts looks. We will use seaborn for that and import it like so:
import seaborn as sns
Given that your pd.DataFrame holding the word-count matrix is called df, sns.distplot(df.sum())
should do the trick. Choose some cutoff that seems like it preserves a good chunk of the counts but doesn't include many words with low counts. It can be arbitrary and it doesn't really matter for now.
Your word count matrix is your input data, or also called the predictor variable. In machine learning this is often called the input matrix or vector X
.
Your target variable is the rating.
The output variable or target variable is the rating column. In machine learning this is often called the output vector y
(note that this can sometimes also be an output matrix but most commonly one outputs a vector).
This means our model tries adjust its parameters to map the word count data from each row to the corresponding rating value.
Scikit-learn offers a lot of machine learning models such as logistic regression which take an X and y for training and have a very unified interface. Jake Vanderplas's Python Data Science Handbook explains the Scikit-learn interface nicely and shows you how to use it for a regression problem.
EDIT: We are using logistic regression here but as correctly pointed out by fuglede, logistic regression ignores that ratings are on a set scale. For this you can use mord.OrdinalRidge, the API of which works very similarly to that of scikit-learn.
Before you train your model, you should split your data set into a training, a test and a validation set - a good ratio is probably 60:20:20.
This way you will be able to train your model on your training set and evaluate how well it is predicting your test data set, to help you tune your model parameters. This way you will know when your model is overfitting to your training data and when it is genuinely producing a good general model for this task. The problem with this approach is that you can still overfit onto your training data if you adjust model parameters often enough.
This is why we have a validation set - it is to make sure we are not accidentally also overfitting our model's parameters to both our training and test set without knowing it. We only test on the validation set once typically, so as to avoid overfitting to it, also - it is only used in the final model evaluation step.
Scikit-learn has a function for that, too: train_test_split
train_test_split however only makes one split, so you would first split your data set 60:40 and then the 40 you would split 50:50 into test and validation set.
You can now train different models on your training data and test them using the predict function of your model on your test set. Once you think you have done a good job and your model is good enough, you now test it on your validation set.