Pandas: groupby sum then nlargest

Question

In pandas, how I will first do group_by then sum then take the top two based on sum?

I tried many thing including the following one.

df.groupby(['A','B'])['C'].sum().groupby(['A']).nlargest(2,'C')
df.groupby(['A','B'])['C'].sum().groupby(['A']).apply(lambda x:nlargest(2))

I want to get the sum of C based on combination A and B, then filter out top two based on sum within group A.

Thank you.

Here is the data:

A        B         C 
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

Please provide a sample of `df` along with your expected output. — not_speshal, Sep 15 '21 at 17:45
You can use this data: https://stackoverflow.com/questions/40390634/pandas-groupby-nlargest-sum — Tamal, Sep 15 '21 at 18:00

SeaBean · Accepted Answer · 2021-09-15T18:58:36.177

Based on your sample data, you can try:

(df.groupby(['A', 'B'], as_index=False)['C'].sum()
   .groupby('A')['C'].nlargest(2)
   .droplevel(1)
)

Data Input:

         A     B    C
0  Alabama     a  100
1  Alabama     b   50
2  Alabama     c   40
3  Alabama     d    5
4  Alabama     e    1
5  Wyoming  a.51  180
6  Wyoming  b.51  150
7  Wyoming  c.51   56
8  Wyoming  d.51    5

Output:

A
Alabama    100
Alabama     50
Wyoming    180
Wyoming    150
Name: C, dtype: int64

Extended Test Cases

Let's try with more data to show the sums of the first groupby() works and how it also works after grouped by A again:

Data Input

          A     B    C
0   Alabama     a  100
1   Alabama     b   50
2   Alabama     b  250
3   Alabama     c   40
4   Alabama     d    5
5   Alabama     d  355
6   Alabama     e    1
7   Wyoming  a.51  180
8   Wyoming  b.51  150
9   Wyoming  c.51   56
10  Wyoming  c.51  556
11  Wyoming  d.51    5
12  Wyoming  d.51  820

Output

A
Alabama    360
Alabama    300
Wyoming    825
Wyoming    612
Name: C, dtype: int64

Edit

If you want to show all columns, you can use:

(df.groupby(['A','B'], as_index=False)['C'].sum()
   .groupby(['A']).apply(lambda x: x.nlargest(2,'C'))
   .reset_index(drop=True)
)

Data Input

          A     B    C
0   Alabama     a  100
1   Alabama     b   50
2   Alabama     b  250
3   Alabama     c   40
4   Alabama     d    5
5   Alabama     d  355
6   Alabama     e    1
7   Wyoming  a.51  180
8   Wyoming  b.51  150
9   Wyoming  c.51   56
10  Wyoming  c.51  556
11  Wyoming  d.51    5
12  Wyoming  d.51  820

Output

         A     B    C
0  Alabama     d  360
1  Alabama     b  300
2  Wyoming  d.51  825
3  Wyoming  c.51  612

score 0 · Answer 2 · answered Sep 15 '21 at 17:54

0

You need to specify the column on which you're applying nlargest after groupby. Try:

>>> df.groupby(["A","B"]).sum().groupby("A")["C"].nlargest(2)

answered Sep 15 '21 at 17:54

not_speshal

22,093
2
15
30

score 0 · Answer 3 · edited Sep 15 '21 at 18:08

0

I fixed this as follows, but still looking for better solutions.

df.groupby(['A','B'])['C'].sum().reset_index().groupby(['A']).apply(lambda x:nlargest(2,'C'))

edited Sep 15 '21 at 18:08

not_speshal

22,093
2
15
30

answered Sep 15 '21 at 18:07

Tamal

45
5

You want to show only column A and C in the result, or all of A, B, C ? – SeaBean Sep 15 '21 at 18:46
See my last edited version for a better solution, if you want to show all A, B, C columns. – SeaBean Sep 15 '21 at 19:02
Thank you, SeaBean. You are extremely helpful. – Tamal Sep 15 '21 at 19:22
Welcome! Pleased to help! Have you also upvoted my solution ? Please do it if not already done so :-) – SeaBean Sep 15 '21 at 19:23

Pandas: groupby sum then nlargest

3 Answers3

Extended Test Cases

Edit