I want to apply a sort and a limit within the groups returned by groupby
, as in this question. However, I have multiple aggregates and I want all of the aggregates to be retained in the result.
Here is a simple example:
products = ["A", "B", "C", "D"]
stores = ["foo", "bar", "baz"]
n = 30
product_list = [products[i] for i in np.random.randint(0, len(products), n)]
store_list = [stores[i] for i in np.random.randint(0, len(stores), n)]
rating_list = np.random.random(n) * 5
sales_list = np.random.random(n) * 10000
df = pd.DataFrame(
{'store': store_list,
'product': product_list,
'sales': sales_list,
'rating': rating_list})
df = df[['store', 'product', 'sales', 'rating']]
df[:5]
I want to group by store
and product
, and both sum
and count
the sales
, while taking the mean
of the rating
.
This is straightforward:
dfg = df.groupby(['store', 'product']).agg({'sales': ['sum', 'count'],
'rating': 'mean'})
Now, I want to only keep the top two rated rows in each group. I can get this as follows (using a multi-level extension of the somewhat unintuitive [to me] incantations from 1):
g = dfg[('rating', 'mean')].groupby(
level=0, group_keys=False).apply(
lambda x: x.sort_values(ascending=False).head(2))
g
This returns the following Series
:
store product
bar B 3.601135
A 1.867449
baz B 2.984196
D 2.780500
foo B 3.767912
D 3.129346
Name: (rating, mean), dtype: float64
But I've lost the ('sales', 'sum')
and ('sales', 'count')
columns.
I suspect I need to extract g.index
and use it in some way but have not been able to get this sorted (pun intended).
EDIT: The answers below manage to give the groups that I'm looking for, but what I'd really like is a stable sort where not only do I get the top N mean ratings within each group, but the groups themselves are sorted so that the first group has the highest rating, etc. To some extent this is just icing on the cake as I have the values I want now and would just like the report to be prettier.