Extracting top-N occurrences in a grouped dataframe using pandas

Question

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant

The columns are:

rest_type - Column for the type of restaurant

name - Column for the name of the restaurant

url - Column used for counting occurrences

This was the code that ended up working for me after some searching:

df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})

The final output was as follows:

I had a few questions pertaining to the above code:

How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.

What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:

Thank you

Please share a sample of your original `df` for a [MRE](https://stackoverflow.com/help/minimal-reproducible-example) — Corralien, Jun 30 '21 at 08:00
From MRE, *"DO NOT use images of code. Copy the actual text from your code editor, paste it into the question, then format it as code. This helps others more easily read and test your code."* and read https://stackoverflow.com/q/20109391/15239951 — Corralien, Jun 30 '21 at 08:19

mozway · Answer 1 · 2021-06-30T09:04:51.547

You can use groupby a second time as it is present in the index which is recognized by groupby.

level_0 comes from the reset_index command because you index is unnamed.

That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:

import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
                   'name': random.choices('abcdef', k=20),
                   'url': range(20), # looks like this is a unique identifier
                  })

def tops(s, n=3):
    return s.value_counts().sort_values(ascending=False).head(n)

df.groupby('rest_type')['name'].apply(tops, n=3)

edit: here is an alternative to format the result as a dataframe with informative column names

(df.groupby('rest_type')
   .apply(lambda x: x['name'].value_counts().nlargest(3))
   .reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)

@Naman Sood: did the proposed solution work for you? If not can you give some feedback? — mozway, Jul 03 '21 at 04:58

score 0 · Answer 2 · answered Feb 12 '23 at 11:28

I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always. Here in my input data frame.

And my query is below

top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )

I am getting result as

Where as The cooccurrence is always giving me 1.

I am expecting a result along with the cooccurrence count. Any help.? — Rahul Anand, Feb 12 '23 at 11:32

Extracting top-N occurrences in a grouped dataframe using pandas

2 Answers2