307

I want to group my dataframe by two columns and then sort the aggregated results within those groups.

In [167]: df

Out[167]:
   count     job source
0      2   sales      A
1      4   sales      B
2      6   sales      C
3      3   sales      D
4      7   sales      E
5      5  market      A
6      3  market      B
7      2  market      C
8      4  market      D
9      1  market      E


In [168]: df.groupby(['job','source']).agg({'count':sum})

Out[168]:
               count
job    source       
market A           5
       B           3
       C           2
       D           4
       E           1
sales  A           2
       B           4
       C           6
       D           3
       E           7

I would now like to sort the 'count' column in descending order within each of the groups, and then take only the top three rows. To get something like:

                count
job     source
market  A           5
        D           4
        B           3
sales   E           7
        C           6
        B           4
smci
  • 32,567
  • 20
  • 113
  • 146
JoeDanger
  • 3,325
  • 4
  • 13
  • 14
  • The reason this is tricky in pandas is when you `groupby` more than one group, the intermediate (grouper) object gets a multiindex containing those groups, and the original index is dropped. Unless you override the default `groupby(... as_index=True)` – smci Jun 16 '22 at 00:39

9 Answers9

311

You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.

In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)

Out[35]: 
   count     job source
4      7   sales      E
2      6   sales      C
1      4   sales      B
5      5  market      A
8      4  market      D
6      3  market      B
tvashtar
  • 4,055
  • 1
  • 14
  • 12
  • 35
    Does `groupby` guarantees that the order is preserved? – toto_tico May 10 '17 at 08:17
  • 83
    It seems it does; from [the documentation of groupby](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html): __groupby preserves the order of rows within each group__ – toto_tico May 10 '17 at 08:22
  • 26
    toto_tico- That is correct, however care needs to be taken in interpreting that statement. The order of rows WITHIN A SINGLE GROUP are preserved, however groupby has a sort=True statement by default which means the groups themselves may have been sorted on the key. In other words if my dataframe has keys (on input) 3 2 2 1,.. the group by object will shows the 3 groups in the order 1 2 3 (sorted). Use sort=False to make sure group order and row order are preserved. – brian_ds Oct 09 '18 at 16:19
  • 6
    head(3) gives more than 3 results? – Nabin Aug 03 '19 at 11:49
  • 1
    @Nabin Applied on a `groupby`, it seems to provide the first 3 rows for each group. – Skippy le Grand Gourou Jan 28 '21 at 15:46
  • 2
    I don't understand why this got most of the votes, while it did not take care of the sum() of the 'count'. If one adds an extra line with the values ('sales', 'A', 6) one can see that this solution does not add the 2 + 6 of ('sales', 'A') which is 8 and should be the first line of the result. – Zvi May 05 '21 at 14:54
  • Would this work if I just sorted by `"count"`? Especially if I didn't care about the order of `"job"` in the output? – sachinruk Dec 23 '22 at 04:52
223

What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.

Starting from the result of the first groupby:

In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})

We group by the first level of the index:

In [63]: g = df_agg['count'].groupby('job', group_keys=False)

Then we want to sort ('order') each group and take the first three elements:

In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))

However, for this, there is a shortcut function to do this, nlargest:

In [65]: g.nlargest(3)
Out[65]:
job     source
market  A         5
        D         4
        B         3
sales   E         7
        C         6
        B         4
dtype: int64

So in one go, this looks like:

df_agg['count'].groupby('job', group_keys=False).nlargest(3)
joris
  • 133,120
  • 36
  • 247
  • 202
  • Would there be a way to sum up everything that isn't contained in the top three results per group and add them to a source group called "other" for each job? – JoeDanger Jan 11 '15 at 20:15
  • Thanks for the great answer. For a further step, would there be a way to assign the sorting order based on values in the groupby column? For instance, sort ascending if the value is 'Buy' and sort descending if the value is 'Sell'. – Bowen Liu Jul 07 '20 at 20:05
  • It might be easier to just [use](https://stackoverflow.com/questions/10373660/converting-a-pandas-groupby-output-from-series-to-dataframe#comment114163633_10374456) `as_index=False` to create a normal data frame and then sort as normal. – young_souvlaki Oct 28 '20 at 00:33
  • @young_souvlaki you still need a groupby operation to take only the first 3 per group, that's not possible with a normal sort – joris Oct 28 '20 at 07:08
  • @joris `as_index` is a `groupby` parameter. Are we on the same page? – young_souvlaki Oct 30 '20 at 22:56
  • For sure, but using `as_index` or not doesn't change that you need a second groupby, or doesn't change that using `nlargest` (without manually sorting) is IMO the nicest solution – joris Oct 31 '20 at 08:48
  • Okay, thanks for clarifying I see your point! Would `nlargest()` work on a normal data frame returned with use of `as_index=False`? – young_souvlaki Nov 02 '20 at 01:44
36

Here's other example of taking top 3 on sorted order, and sorting within the groups:

In [43]: import pandas as pd                                                                                                                                                       

In [44]:  df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})

In [45]: df                                                                                                                                                                        
Out[45]: 
   count_1  count_2  name
0        5      100   Foo
1       10      150   Foo
2       12      100  Baar
3       15       25   Foo
4       20      250  Baar
5       25      300   Foo
6       30      400  Baar
7       35      500  Baar


### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)                                                                                                                               
Out[46]: 
name   
Baar  7    35
      6    30
      4    20
Foo   5    25
      3    15
      1    10
dtype: int64


### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]: 
   count_1  count_2  name
0       35      500  Baar
1       30      400  Baar
2       20      250  Baar
3       12      100  Baar
4       25      300   Foo
5       15       25   Foo
6       10      150   Foo
7        5      100   Foo
Surya
  • 11,002
  • 4
  • 57
  • 39
30

Try this Instead, which is a simple way to do groupby and sorting in descending order:

df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)
Kaveh
  • 4,618
  • 2
  • 20
  • 33
sscswapnil
  • 639
  • 6
  • 6
17

If you don't need to sum a column, then use @tvashtar's answer. If you do need to sum, then you can use @joris' answer or this one which is very similar to it.

df.groupby(['job']).apply(lambda x: (x.groupby('source')
                                      .sum()
                                      .sort_values('count', ascending=False))
                                     .head(3))
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
2

When grouped dataframe contains more than one grouped column ("multi-index"), using other methods erases other columns:

edf = pd.DataFrame({"job":["sales", "sales", "sales", "sales", "sales",
                           "market", "market", "market", "market", "market"],
                    "source":["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
                    "count":[2, 4,6,3,7,5,3,2,4,1],
                    "other_col":[1,2,3,4,56,6,3,4,6,11]})

gdf = edf.groupby(["job", "source"]).agg({"count":sum, "other_col":np.mean})
gdf.groupby(level=0, group_keys=False).apply(lambda g:g.sort_values("count", ascending=False))

This keeps other_col as well as ordering by count column within each group

smci
  • 32,567
  • 20
  • 113
  • 146
haneulkim
  • 4,406
  • 9
  • 38
  • 80
1

I was getting this error without using "by":

TypeError: sort_values() missing 1 required positional argument: 'by'

So, I changed it to this and now it's working:

df.groupby(['job','source']).agg({'count':sum}).sort_values(by='count',ascending=False).head(20)

parvaneh shayegh
  • 507
  • 5
  • 13
0

You can do it in one line -

df.groupby(['job']).apply(lambda x: x.sort_values(['count'], ascending=False).head(3)
.drop('job', axis=1))

what apply() does is that it takes each group of groupby and assigns it to the x in lambda function.

0

@joris answer helped a lot. This is what worked for me.

df.groupby(['job'])['count'].nlargest(3)