I have two dataframes of customer reviews data.
My first dataframe, 'df' contains thousands of raw customer reviews, processed/cleaned reviews data, and sentiment scores:
reviewBody reviewClean sentimentScore
'I like these goggles' 'like goggles' 1
'I don't like these goggles' 'don't like goggles' -1
'My strap broke' 'strap broke' -1
... ... ...
My second dataframe, 'bigrams' contains the most frequent bigrams in the field called 'reviewClean' from my first dataframe:
topBigrams frequency
'like goggles' 150
'strap broke' 100
... ...
My goal is to take each of my topBigrams, e.g. 'like goggles' or 'strap broke', look up every 'reviewClean' that contains each bigram AND the associated sentiment to that entire review, and and calculate an average sentiment score for each topBigram.
My end result would look something like this (numbers for pure illustration):
topBigrams frequency avgSentiment
'like goggles' 150 .98
'strap broke' 100 -.90
... ... ...
From this data, I would then look for trends on each bigram to determine the drivers of positive or negative sentiment in a more succinct way.
I am not even sure where to begin. Many thanks for any insight into a potential approach here.