The data I have is stored in a pandas dataframe - please see a reproducible example below. The real dataframe will have more than 10k lines and a lot more ords / phrases per line.
I'd like to count the number of times each two-word phrase appears in column ReviewContent
. If this were a text file and not the column of a dataframe I would use NLTK's Collocations module (something along the lines of answers here or here ). My question is: how can I transform column ReviewContent
into a single corpus text?
import numpy as np
import pandas as pd
data = {'ReviewContent' : ['Great food',
'Low prices but above average food',
'Staff was the worst',
'Great location and great food',
'Really low prices',
'The daily menu is usually great',
'I waited a long time to be served, but it was worth it. Great food']}
df = pd.DataFrame(data)
Expected output:
[(('great', 'food'), 3), (('low', 'prices'), 2), ...]
or
[('great food', 3), ('low prices', 2)...]