1

I'm working with the recent_grads practice data from fivethirtyeight, and I'm trying to see which major ('Major') has the largest number of students ('Total') in each major category ('Major_category').

Here's an example dataframe:

Major              Major Category    Total
Petroleum Eng      Engineering       1001
Nuclear Eng        Engineering       4350
Marketing          Business          10035
Accounting         Business          3051

I would like to have output like the following:

Major              Major Category    Total
Nuclear Eng        Engineering       4350
Marketing          Business          10035

...where only the Majors that have the largest Total in each Major Category are returned.

I've used a groupby statement that returns the largest number of students in each major category like so:

recent_grads.groupby('Major_category')['Total'].agg('max')

As expected, this returns the largest student count in each population. What I can't figure out is where to insert the 'Major' variable in the above code so my output not only tells me what the largest student count in each major category is, but what major it belongs to as well. My code throws an error no matter where I try putting 'Major,' but it feels like I'm missing something obvious.

Andy Holstein
  • 33
  • 1
  • 4
  • 1
    We'll likely need a [mcve] – rafaelc Apr 16 '19 at 14:44
  • 1
    You should include an example dataset, so we can see visually with you mean. Most cases 5-10 rows is enough. Find more information [here](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Erfan Apr 16 '19 at 14:48
  • I think you want `df.sort_values('Total').drop_duplicates('Major_category', keep='last')`. If so perhaps consider marking as a duplicate of https://stackoverflow.com/questions/15705630/python-getting-the-row-which-has-the-max-value-in-groups-using-groupby – ALollz Apr 16 '19 at 14:54
  • Thanks for the feedback. Sample data added. I appreciate your patience. – Andy Holstein Apr 16 '19 at 15:08

4 Answers4

2

Try sort_values + drop_duplicates

recent_grads.sort_values(['Total']).drop_duplicates('Major_category',keep='last')

Or transform

s=recent_grads.groupby(['Major_category'])['Total'].transform('max')
recent_grads[recent_grads.Total==s]

Or idxmax

s=recent_grads.groupby(['Major_category'])['Total'].idxmax()
recent_grads.iloc[s]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Yes, the first one gets me what I needed! Thanks! I'm still thinking there must be a way to do it in the groupby statement I'm missing. – Andy Holstein Apr 16 '19 at 15:16
0

You can groupby multiple columns:

recent_grads.groupby(['Major', 'Major_category'])['Total'].agg('max')
tvgriek
  • 1,215
  • 9
  • 20
  • Thanks for the quick reply! Unfortunately, I tried that earlier and it returns totals for all majors in major category; I'm just looking for the largest total in each major category, along with the corresponding major. – Andy Holstein Apr 16 '19 at 14:50
0

Does this accomplish what you want?

recent_grads.groupby(['Major_category'])[["Major", "Total"]].max()
pythonjokeun
  • 431
  • 2
  • 8
  • Thanks for the quick reply! Unfortunately, I tried that earlier and it returns totals for all majors in major category; I'm just looking for the largest total in each major category, along with the corresponding major. – Andy Holstein Apr 16 '19 at 14:49
  • @AndyHolstein like this ? `recent_grads.groupby(["Major_category"])[["Major", "Total"]].max()` – pythonjokeun Apr 16 '19 at 14:54
  • @Erfan that's that's why i'm asking "Does this accomplish what you want?" :) – pythonjokeun Apr 16 '19 at 14:55
0

I think the following is an "pandas-idiomatic" way of doing this --

In [11]: df = pd.DataFrame({"Major": ["Petroleum Eng", "Nuclear Eng", "Marketing", "Accounting"],  
    ...:                 "Major Category": ["Engineering", "Engineering", "Business", "Business"], 
    ...:                 "Total": [1001, 4350, 10035, 3051] })                                                                                                                           

In [12]: df.groupby("Major Category").apply(lambda x: x.sort_values("Total").iloc[0,:])                                                                                                  
Out[12]: 
                        Major Major Category  Total
Major Category                                     
Business           Accounting       Business   3051
Engineering     Petroleum Eng    Engineering   1001

In [13]: df.groupby("Major Category").apply(lambda x: x.sort_values("Total", ascending=False).iloc[0,:])                                                                                 
Out[13]: 
                      Major Major Category  Total
Major Category                                   
Business          Marketing       Business  10035
Engineering     Nuclear Eng    Engineering   4350

If you have a bigger dataset, try profiling how long it takes to run these recipes.

The only performance metric for pandas that makes sense to me is how a recipe or snippet scales with number of rows. %timeit for small dataset doesn't really prove effective IMO.

Debanjan Basu
  • 834
  • 1
  • 8
  • 29
  • Thanks, this is good stuff. Gotta admit, trying to learn all of this is a bit overwhelming. I have a good handle on statistics, but I'm a pretty weak programmer and want to improve my capabilities in that area (trying to break into data science). – Andy Holstein Apr 16 '19 at 15:53
  • You are right in that it is overwhelming, but I think you need a primer in SQL to really appreciate pandas. I am not a programmer as well, but understanding Python and pandas idioms properly (and profiling my snippets, when in doubt) helps me do most if what I need to do. – Debanjan Basu Apr 16 '19 at 17:30
  • If you are not completely invested into pandas, just use R ... Especially tidyr. It is the kind of functional framework that helps you write performant code with minimal hassles and equanimity. – Debanjan Basu Apr 16 '19 at 17:34