Can columns be included in a groupby statement that aren't directly involved in the calculations?

Question

I'm working with the recent_grads practice data from fivethirtyeight, and I'm trying to see which major ('Major') has the largest number of students ('Total') in each major category ('Major_category').

Here's an example dataframe:

Major              Major Category    Total
Petroleum Eng      Engineering       1001
Nuclear Eng        Engineering       4350
Marketing          Business          10035
Accounting         Business          3051

I would like to have output like the following:

Major              Major Category    Total
Nuclear Eng        Engineering       4350
Marketing          Business          10035

...where only the Majors that have the largest Total in each Major Category are returned.

I've used a groupby statement that returns the largest number of students in each major category like so:

recent_grads.groupby('Major_category')['Total'].agg('max')

As expected, this returns the largest student count in each population. What I can't figure out is where to insert the 'Major' variable in the above code so my output not only tells me what the largest student count in each major category is, but what major it belongs to as well. My code throws an error no matter where I try putting 'Major,' but it feels like I'm missing something obvious.

You should include an example dataset, so we can see visually with you mean. Most cases 5-10 rows is enough. Find more information [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Erfan, Apr 16 '19 at 14:48
I think you want `df.sort_values('Total').drop_duplicates('Major_category', keep='last')`. If so perhaps consider marking as a duplicate of https://stackoverflow.com/questions/15705630/python-getting-the-row-which-has-the-max-value-in-groups-using-groupby — ALollz, Apr 16 '19 at 14:54
Thanks for the feedback. Sample data added. I appreciate your patience. — Andy Holstein, Apr 16 '19 at 15:08

BENY · Accepted Answer · 2019-04-16T15:01:44.090

2

Try sort_values + drop_duplicates

recent_grads.sort_values(['Total']).drop_duplicates('Major_category',keep='last')

Or transform

s=recent_grads.groupby(['Major_category'])['Total'].transform('max')
recent_grads[recent_grads.Total==s]

Or idxmax

s=recent_grads.groupby(['Major_category'])['Total'].idxmax()
recent_grads.iloc[s]

edited Apr 16 '19 at 15:01

answered Apr 16 '19 at 14:57

BENY

317,841
20
164
234

Yes, the first one gets me what I needed! Thanks! I'm still thinking there must be a way to do it in the groupby statement I'm missing. – Andy Holstein Apr 16 '19 at 15:16

score 0 · Answer 2 · answered Apr 16 '19 at 14:43

0

You can groupby multiple columns:

recent_grads.groupby(['Major', 'Major_category'])['Total'].agg('max')

answered Apr 16 '19 at 14:43

tvgriek

1,215
9
20

Thanks for the quick reply! Unfortunately, I tried that earlier and it returns totals for all majors in major category; I'm just looking for the largest total in each major category, along with the corresponding major. – Andy Holstein Apr 16 '19 at 14:50

pythonjokeun · Answer 3 · 2019-04-16T15:03:10.020

0

Does this accomplish what you want?

recent_grads.groupby(['Major_category'])[["Major", "Total"]].max()

edited Apr 16 '19 at 15:03

answered Apr 16 '19 at 14:45

pythonjokeun

431
2
8

Thanks for the quick reply! Unfortunately, I tried that earlier and it returns totals for all majors in major category; I'm just looking for the largest total in each major category, along with the corresponding major. – Andy Holstein Apr 16 '19 at 14:49
@AndyHolstein like this ? `recent_grads.groupby(["Major_category"])[["Major", "Total"]].max()` – pythonjokeun Apr 16 '19 at 14:54
@Erfan that's that's why i'm asking "Does this accomplish what you want?" :) – pythonjokeun Apr 16 '19 at 14:55

score 0 · Answer 4 · answered Apr 16 '19 at 15:43

I think the following is an "pandas-idiomatic" way of doing this --

In [11]: df = pd.DataFrame({"Major": ["Petroleum Eng", "Nuclear Eng", "Marketing", "Accounting"],  
    ...:                 "Major Category": ["Engineering", "Engineering", "Business", "Business"], 
    ...:                 "Total": [1001, 4350, 10035, 3051] })                                                                                                                           

In [12]: df.groupby("Major Category").apply(lambda x: x.sort_values("Total").iloc[0,:])                                                                                                  
Out[12]: 
                        Major Major Category  Total
Major Category                                     
Business           Accounting       Business   3051
Engineering     Petroleum Eng    Engineering   1001

In [13]: df.groupby("Major Category").apply(lambda x: x.sort_values("Total", ascending=False).iloc[0,:])                                                                                 
Out[13]: 
                      Major Major Category  Total
Major Category                                   
Business          Marketing       Business  10035
Engineering     Nuclear Eng    Engineering   4350

If you have a bigger dataset, try profiling how long it takes to run these recipes.

The only performance metric for pandas that makes sense to me is how a recipe or snippet scales with number of rows. %timeit for small dataset doesn't really prove effective IMO.

Thanks, this is good stuff. Gotta admit, trying to learn all of this is a bit overwhelming. I have a good handle on statistics, but I'm a pretty weak programmer and want to improve my capabilities in that area (trying to break into data science). — Andy Holstein, Apr 16 '19 at 15:53
You are right in that it is overwhelming, but I think you need a primer in SQL to really appreciate pandas. I am not a programmer as well, but understanding Python and pandas idioms properly (and profiling my snippets, when in doubt) helps me do most if what I need to do. — Debanjan Basu, Apr 16 '19 at 17:30
If you are not completely invested into pandas, just use R ... Especially tidyr. It is the kind of functional framework that helps you write performant code with minimal hassles and equanimity. — Debanjan Basu, Apr 16 '19 at 17:34

Can columns be included in a groupby statement that aren't directly involved in the calculations?

4 Answers4