Pandas GroupBy Two Text Columns And Return The Max Rows Based On Counts

Question

I'm trying to figure out the max (First_Word, Group) pairs

import pandas as pd

df = pd.DataFrame({'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'],
           'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'],
           'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice',
                'apple fell out of the tree', 'partrige in a pear tree']},
          columns=['First_Word', 'Group', 'Text'])

  First_Word         Group                        Text
0      apple    apple bins     where to buy apple bins
1      apple   apple trees         i see an apple tree
2     orange  orange juice         i like orange juice
3      apple   apple trees  apple fell out of the tree
4       pear     pear tree     partrige in a pear tree

Then I do a groupby:

grouped = df.groupby(['First_Word', 'Group']).count()
                         Text
First_Word Group             
apple      apple bins       1
           apple trees      2
orange     orange juice     1
pear       pear tree        1

And I now want to filter it down to only unique index rows that have the max Text counts. Below you'll notice apple bins was removed because apple trees has the max value.

                         Text
First_Word Group             
apple      apple trees      2
orange     orange juice     1
pear       pear tree        1

This max value of group question is similar but when I try something like this:

df.groupby(["First_Word", "Group"]).count().apply(lambda t: t[t['Text']==t['Text'].max()])

I get an error: KeyError: ('Text', 'occurred at index Text'). If I add axis=1 to the apply I get IndexError: ('index out of bounds', 'occurred at index (apple, apple bins)')

unutbu · Accepted Answer · 2016-06-09T23:18:25.220

Given grouped, you now want to group by the First Word index level, and find the index labels of the maximum row for each group (using idxmax):

In [39]: grouped.groupby(level='First_Word')['Text'].idxmax()
Out[39]: 
First_Word
apple       (apple, apple trees)
orange    (orange, orange juice)
pear           (pear, pear tree)
Name: Text, dtype: object

You can then use grouped.loc to select rows from grouped by index label:

import pandas as pd
df = pd.DataFrame(
    {'First_Word': ['apple', 'apple', 'orange', 'apple', 'pear'],
     'Group': ['apple bins', 'apple trees', 'orange juice', 'apple trees', 'pear tree'],
     'Text': ['where to buy apple bins', 'i see an apple tree', 'i like orange juice',
              'apple fell out of the tree', 'partrige in a pear tree']},
    columns=['First_Word', 'Group', 'Text'])

grouped = df.groupby(['First_Word', 'Group']).count()
result = grouped.loc[grouped.groupby(level='First_Word')['Text'].idxmax()]
print(result)

yields

                         Text
First_Word Group             
apple      apple trees      2
orange     orange juice     1
pear       pear tree        1

Pandas GroupBy Two Text Columns And Return The Max Rows Based On Counts

1 Answers1