Pandas groupby sort within groups retaining multiple aggregates

Question

I want to apply a sort and a limit within the groups returned by groupby, as in this question. However, I have multiple aggregates and I want all of the aggregates to be retained in the result.

Here is a simple example:

products = ["A", "B", "C", "D"]
stores = ["foo", "bar", "baz"]
n = 30

product_list = [products[i] for i in np.random.randint(0, len(products), n)]
store_list = [stores[i] for i in np.random.randint(0, len(stores), n)]
rating_list = np.random.random(n) * 5
sales_list = np.random.random(n) * 10000

df = pd.DataFrame(
    {'store': store_list, 
     'product': product_list, 
     'sales': sales_list, 
     'rating': rating_list})

df = df[['store', 'product', 'sales', 'rating']]

df[:5]

I want to group by store and product, and both sum and count the sales, while taking the mean of the rating.

This is straightforward:

dfg = df.groupby(['store', 'product']).agg({'sales': ['sum', 'count'], 
                                            'rating': 'mean'})

Now, I want to only keep the top two rated rows in each group. I can get this as follows (using a multi-level extension of the somewhat unintuitive [to me] incantations from 1):

g = dfg[('rating', 'mean')].groupby(
      level=0, group_keys=False).apply(
        lambda x: x.sort_values(ascending=False).head(2))
g

This returns the following Series:

store  product
bar    B          3.601135
       A          1.867449
baz    B          2.984196
       D          2.780500
foo    B          3.767912
       D          3.129346
Name: (rating, mean), dtype: float64

But I've lost the ('sales', 'sum') and ('sales', 'count') columns.

I suspect I need to extract g.index and use it in some way but have not been able to get this sorted (pun intended).

EDIT: The answers below manage to give the groups that I'm looking for, but what I'd really like is a stable sort where not only do I get the top N mean ratings within each group, but the groups themselves are sorted so that the first group has the highest rating, etc. To some extent this is just icing on the cake as I have the values I want now and would just like the report to be prettier.

score 5 · Accepted Answer · answered Jan 23 '18 at 02:41

5

I've sorted it. Instead of indexing the grouped table and doing the subsequent groupby and sort_values as above, I needed to apply the sort_values to the un-indexed DataFrame, specifying the column to sort on explicitly:

g = dfg.groupby(level=0, group_keys=False).apply(
      lambda x: x.sort_values(('rating', 'mean'), ascending=False).head(2))

Giving me the desired result:

answered Jan 23 '18 at 02:41

sfjac

7,119
5
45
69

Ah, sorry, didn't see you post your answer when I did. Shall I remove? – cs95 Jan 23 '18 at 02:43

cs95 · Answer 2 · 2018-01-23T03:15:29.103

3

This is how you'd do it, using sort_values + groupby + head -

dfg.sort_values(('rating', 'mean'), ascending=False)\
   .groupby(level=0)\
   .head(2)\
   .sort_index()

                 rating         sales      
                   mean           sum count
store product                              
bar   B        4.388521    636.813757     1
      C        3.931341   1843.772878     1
baz   A        2.501077  23110.162196     4
      B        3.339784  10610.257660     2
foo   B        2.785306  10315.968161     2
      D        2.160556  31084.181719     5

The final sort_index call sorts out the index-out-of-order problem (pun intended).

edited Jan 23 '18 at 03:15

answered Jan 23 '18 at 02:43

cs95

379,657
97
704
746

Right - I had managed that but the groups were not maintained. Should have mentioned it. – sfjac Jan 23 '18 at 02:43
1

@sfjac Sorry, forgot to add the important bit in ;) – cs95 Jan 23 '18 at 02:44
1

@sfjac Or `.sort_index(ascending=[True, False])` if you want finer control over how each level is sorted individually. – cs95 Jan 23 '18 at 02:52
@sfjac Hmm, `sort_index` sorts the index, but does not care about the order of the ratings. In the end, a groupby + apply + sort_values might be more appropriate... :( – cs95 Jan 23 '18 at 03:03
@sfjac Edit, never mind, just noticed it's exactly like yours. Sorry, I'm still not sure what your expected output is and how your current output is incorrect... can you help me understand better please? – cs95 Jan 23 '18 at 03:18
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/163683/discussion-between-sfjac-and-cs). – sfjac Jan 23 '18 at 06:03

Pandas groupby sort within groups retaining multiple aggregates

2 Answers2