How can I obtain the top n groups in pandas?

Question

I have a pandas dataframe. The final column in the dataframe is the max value of the RelAb column for each unique group (in this case, a species assignment) in the dataframe as obtained by:

df_melted['Max'] = df_melted.groupby('Species')['RelAb'].transform('max')

As you can see, the max value is represented in all rows of the group. Each group contains a large number of rows. I have the df sorted by max values, for which there are about 100 rows per max value. My goal is to obtain the top 20 groups based on the max value (i.e. a df with 100 X 20 rows - 2000 rows). I do not want to drop individual rows from groups in the dataframe, rather entire groups.

I am pasting a subset of the dataframe where the max for a group changes from one "Max" value to the next:

original df

My feeling is that I need to convert the max so that the one value represents the entire group and then sort based on that column, perhaps as such?

Possible <code>df</code> to address inquiry

For context, the reason I am doing this is because I am planning to make a stacked barchart with the most abundant species in the table for each sample. Right now, there are just way too many species, so it makes the stacked bar chart uninformative.

[Stack Overflow Discourages Screenshots](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). It is likely the question will be downvoted. You are discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. — Trenton McKinney, Nov 12 '19 at 20:28
Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/provide-a-reproducible-copy-of-the-dataframe-with-to-clipboard/52413247#52413247) — Trenton McKinney, Nov 12 '19 at 20:34
Hi Trenton, Sounds good. I tried doing the clipboard thing but was having issues because stack said I didn't have enough points accrued. — Protaeus, Nov 13 '19 at 15:19
Actually, after reading the link you sent @Trenton, I realized that was not at all the approach I was taking to upload the dataframe. I will follow those steps next time. — Protaeus, Nov 13 '19 at 21:26

score 1 · Accepted Answer · answered Nov 12 '19 at 20:46

1

One way to do it:


aux = (df_melted.groupby('Species')['RelAb']
           .max()
           .nlargest(20, keep='all')
           .to_list())

top20 = df_melted.loc[df_melted['Max'].isin(aux), :].copy()

answered Nov 12 '19 at 20:46

usenk

106
4

Hello @usenk, This solution worked well. If you have an extra moment or two, could you explain what is happening during the steps? Your solution implemented the code I used to generate a `Max`, but I am curious if you could either circumvent adding that column altogether or incorporate that step into your line to make it more succinct? I'm fairly new to python/pandas, so I'm only asking to understand things a bit better. – Protaeus Nov 13 '19 at 20:46
Sure @Protaeus! Here's what's happening in my code: I calculate maximums by groups again, but now using ```apply``` method which collapses the dataframe so that groups do not repeat (i.e., in your case that means one row for B.dorei, one for Prevotella etc.). This gives us a Series, to which I then apply ```nlargest``` method which selects, in this case, 20 largest elements. I then use ```isin``` method to get a Boolean array indicating whether value of ```Max``` is one of the 20 largest (```isin(iterable)``` tests for elements being in the iterable). – usenk Nov 14 '19 at 15:14
@Protaeus, you could avoid creating ```Max``` by replacing ```.to_list()``` with ```.index```, and then replacing ```df_melted['Max']``` with ```df_melted['Species']```. The idea is to obtain *names* of species that are top-20 by maximum RelAb, and then select observations with those names. Hope this helps. – usenk Nov 14 '19 at 15:21
Thanks, @usenk! I greatly appreciate the info. You mention using the `apply` method, but I don't see that explicitly called? Regardless, it works :-) What does `copy()` do? I read the documentation, but it wasn't super clear what it is doing in this context and how I might use it in the future. I've seen in answers that you've provided to others that you return to it frequently. – Protaeus Nov 14 '19 at 16:01
@Protaeus, sorry about that ```apply``` thing, was thinking of smth else :D In actuality, I call on ```max``` method. ```copy()``` makes it explicit that you are creating a new dataframe, and not just a selection from the old one. Then you can safely change anything in your new dataframe and it will not affect the original one. – usenk Nov 14 '19 at 16:26

How can I obtain the top n groups in pandas?

1 Answers1