How to calculate most frequently occurring words in pandas dataframe column by year?

Question

I have a pandas dataframe that contains a column 'reviews' and a column 'year'. I would like to view the top 100 most frequently occurring words in the reviews column, but filtered by year. So, I want to know the top 100 from 2002, 2003, 2004, and so on, all the way up to 2017.

import pandas as pd
from nltk.corpus import stopwords

df=pd.read_csv('./reviews.csv')

stop = stopwords.words('english')

commonwords = pd.Series(' '.join(df['reviews']).lower().split()).value_counts()[:100]

print(commonwords)

df.to_csv('commonwords.csv', index=False)

The above code works but it only gives me the top 100 most frequently occurring words across all years.

score 2 · Accepted Answer · answered Jul 19 '19 at 11:51

You can use:

df = pd.DataFrame({'reviews':['He writer in me great great me',
                        'great ambience the coffee was great',
                        'great coffee'],
                   'year':[2002,2004,2004]})
print (df)

                               reviews  year
0       He writer in me great great me  2002
1  great ambience the coffee was great  2004
2                         great coffee  2004

#change for 100 for top100 in real data
N = 3
df1 =  (df.set_index('year')['reviews']
          .str.lower()
          .str.split(expand=True)
          .stack()
          .groupby(level=0)
          .value_counts()
          .groupby(level=0)
          .head(N)
          .rename_axis(('year','words'))
          .reset_index(name='count'))

print (df1)
   year     words  count
0  2002     great      2
1  2002        me      2
2  2002        he      1
3  2004     great      3
4  2004    coffee      2
5  2004  ambience      1

Explanation:

Convert values to lowercase by Series.str.lower with Series.str.split for DataFrame
Reshape by DataFrame.stack for MultiIndex Series
count values per groups with SeriesGroupBy.value_counts, values are sorted
get top N values by GroupBy.head
Data cleaning - DataFrame.rename_axis with DataFrame.reset_index

score 1 · Answer 2 · answered Jul 19 '19 at 06:15

Before creating the commonwords data-frame, you can create another dataframe using groupby operation like so df.groupby(['year', 'reviews']). Then use the reset_index operation this so that you can use it for filtering the top 100.

Other than resetting index, you can also refer to the answers in this question for further idea's.

How to calculate most frequently occurring words in pandas dataframe column by year?

2 Answers2