Join and run calculations on Pandas DataFrame based on text matching

Question

I have two dataframes of customer reviews data.

My first dataframe, 'df' contains thousands of raw customer reviews, processed/cleaned reviews data, and sentiment scores:

reviewBody                   reviewClean           sentimentScore
'I like these goggles'       'like goggles'        1
'I don't like these goggles' 'don't like goggles'  -1
'My strap broke'             'strap broke'         -1
 ...                         ...                  ...

My second dataframe, 'bigrams' contains the most frequent bigrams in the field called 'reviewClean' from my first dataframe:

topBigrams                 frequency
'like goggles'               150 
'strap broke'                100
  ...                        ...

My goal is to take each of my topBigrams, e.g. 'like goggles' or 'strap broke', look up every 'reviewClean' that contains each bigram AND the associated sentiment to that entire review, and and calculate an average sentiment score for each topBigram.

My end result would look something like this (numbers for pure illustration):

topBigrams                 frequency   avgSentiment
'like goggles'             150         .98
'strap broke'              100         -.90
 ...                        ...         ...

From this data, I would then look for trends on each bigram to determine the drivers of positive or negative sentiment in a more succinct way.

I am not even sure where to begin. Many thanks for any insight into a potential approach here.

Did you derived the `bigrams` dataframe from `df`? if so, you can just do `df.groupby('reviewClean').mean()`. — najeem, Dec 24 '18 at 02:32
And if you want both the frequency and mean sentiment, you can just do `df.groupby('reviewClean').agg(['mean', 'count']).sort_values(('sentimentScore', 'count'), ascending=False)` — najeem, Dec 24 '18 at 02:50

James · Accepted Answer · 2018-12-24T02:49:02.267

You are going to have to do a cross join see this post in order to check if every review contains every bigram. There is not getting around using apply since you need to do a row-wise string comparison.

df = pd.DataFrame([['I like these goggles', 'like goggles', 1],
        ["I don't like these goggles", "don't like goggles", -1],
        ['My strap broke', 'strap broke', -1]],
        columns=['reviewBody', 'reviewClean', 'sentimentScore'])

bigrams = pd.DataFrame([['like goggles', 150],
        ['strap broke', 100]],
        columns=['topBigrams', 'frequency'])

dfx = bigrams.assign(key=1).merge(df.assign(key=1), on='key').drop('key', 1)
dfx['has_bigram'] = dfx.apply(lambda x: x.get('topBigrams') in x.get('reviewClean'), axis=1)

After checking for the bigram in each cleaned review, you can use a groupby to calculate the mean sentiment on the for bigram, only for where the bigram exists. Then merge it back to the bigrams data frame.

bigrams.merge(dfx.groupby(['topBigrams', 'has_bigram'])
                 .mean()
                 .reset_index()
                 .query('has_bigram')
                 .rename(columns={'sentimentScore':'avgSentiment'})
                 .get(['topBigrams', 'avgSentiment']),
              on='topBigrams')

# returns:
     topBigrams  frequency  avgSentiment
0  like goggles        150             0
1   strap broke        100            -1

Amazing James! Thank you so much for your insights here. After deconstructing the logic, I learned so much about how to navigate the problem and can apply to future, related problems. — Sandy McAllister, Dec 24 '18 at 13:43

Join and run calculations on Pandas DataFrame based on text matching

1 Answers1