I was given this tip in this SO question I asked:
Now that you have your matrix representation (rows are the products, columns are the counts for each unique word), you can filter the matrix down to the most common words. I would encourage you to take a look at how the distribution of word counts looks. We will use seaborn for that and import it like so:
import seaborn as sns
Given that your pd.DataFrame holding the word-count matrix is called df,
sns.distplot(df.sum())
should do the trick. Choose some cutoff that seems like it preserves a good chunk of the counts but doesn't include many words with low counts. It can be arbitrary and it doesn't really matter for now. Your word count matrix is your input data, or also called the predictor variable. In machine learning this is often called the input matrix or vectorX
.
I managed to do the bag of words (BOG) for every column. The code is as follows:
df['BOW'] = df.Review2.str.split().apply(Counter)
But when I try to visualize as suggested (sns.distplot(df['BOW'].sum())
) I get the following error:
unsupported operand type(s) for /: 'Counter' and 'int'
Thx for reading the post and have a good day :)