5

I've got a fun one! And I've tried to find a duplicate question but was unsuccessful...

My dataframe consists of all United States and territories for years 2013-2016 with several attributes.

>>> df.head(2)
     state  enrollees  utilizing  enrol_age65  util_age65  year
1  Alabama     637247     635431       473376      474334  2013
2   Alaska      30486      28514        21721       20457  2013

>>> df.tail(2)
     state               enrollees  utilizing  enrol_age65  util_age65  year
214  Puerto Rico          581861     579514       453181      450150  2016
215  U.S. Territories      24329      16979        22608       15921  2016

I want to groupby year and state, and show the top 3 states (by 'enrollees' or 'utilizing' - does not matter) for each year.

Desired Output:

                                       enrollees  utilizing
year state                                                 
2013 California                          3933310    3823455
     New York                            3133980    3002948
     Florida                             2984799    2847574
...
2016 California                          4516216    4365896
     Florida                             4186823    3984756
     New York                            4009829    3874682

So far I've tried the following:

df.groupby(['year','state'])['enrollees','utilizing'].sum().head(3)

Which yields just the first 3 rows in the GroupBy object:

                 enrollees  utilizing
year state                           
2013 Alabama        637247     635431
     Alaska          30486      28514
     Arizona        707683     683273

I've also tried a lambda function:

df.groupby(['year','state'])['enrollees','utilizing']\
  .apply(lambda x: np.sum(x)).nlargest(3, 'enrollees')

Which yields the absolute largest 3 in the GroupBy object:

                 enrollees  utilizing
year state                           
2016 California    4516216    4365896
2015 California    4324304    4191704
2014 California    4133532    4011208

I think it may have to do with the indexing of the GroupBy object, but I am not sure...Any guidance would be appreciated!

atlas
  • 410
  • 4
  • 14
  • add `.sort_values(by='enrollees', ascending=False)` ***before*** the `.head(3)` – mrbTT Feb 08 '19 at 16:20
  • This will not keep the GroupBy object sorted by year. It will return the top 3 states across **all** years (which happens to be California in 2016, 2015, and 2014 in that order). – atlas Feb 08 '19 at 16:31
  • Try using nlargest(3, ['year', 'enrollees'])? Also see https://stackoverflow.com/questions/47703606/pandas-groupby-top-3-values-for-each-group – Tiblit Feb 08 '19 at 16:37
  • @DouglasPearson Passing a list like ['year','enrollees'] did not work - I had tried that as well :( I think nlargest must be called on an item not being used to index the groupby. As for the link, that solution only works with a group that has at least 3 rows. My groups have only 1 row (one state for each year). It also gets wonky since I'm grouping by two columns. – atlas Feb 08 '19 at 16:43
  • Why does `df.groupby(['State','Year'])['enrollees'].nlargest(3)` not seem to work?? – DJK Feb 08 '19 at 20:06
  • @DJK If you read my comment above, I explain that it will only take the top three largest rows. Not the top three by year. Please also see [https://stackoverflow.com/questions/27842613/pandas-groupby-sort-within-groups] for another answer in addition to the marked answer – atlas Feb 08 '19 at 20:10
  • @atlas works for me when using this approach... – DJK Feb 08 '19 at 22:00

2 Answers2

3

Well, you could do something not that pretty.

First getting a list of unique years using set():

years_list = list(set(df.year))

Create a dummy dataframe and a function to concat that I've made in the past:

def concatenate_loop_dfs(df_temp, df_full, axis=0):
    """
    to avoid retyping the same line of code for every df.
    the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
    values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """ 

if df_full.empty:
    df_full = df_temp
else:
    df_full = pd.concat([df_full, df_temp], axis=axis)

return df_full

creating the dummy final df

df_final = pd.DataFrame()

Now you'll loop for each year and concating into the new DF:

for year in years_list:
    # The query function does a search for where
    # the @year means the external variable, in this case the input from loop
    # then you'll have a temporary DF with only the year and sorting and getting top3
    df2 = df.query("year == @year")

    df_temp = df2.groupby(['year','state'])['enrollees','utilizing'].sum().sort_values(by="enrollees", ascending=False).head(3)
    # finally you'll call our function that will keep concating the tmp DFs
    df_final = concatenate_loop_dfs(df_temp, df_final)

and done.

print(df_final)
mrbTT
  • 1,399
  • 1
  • 18
  • 31
  • I have tried your solution, however "year" is not a column of the groupby object - it is an index. The `years_list = list(set(df2.year))` will not work. I'll try to work around it though. – atlas Feb 08 '19 at 16:52
  • 1
    But it is a column in the `original df`, isn't it? just replace `df2.year` to `df.year`. Or even `df['year']` or even `df2.index` – mrbTT Feb 08 '19 at 16:54
  • Hmm btw since it's an index the loop won't quite work, i'll edit it to suit you, just a second. Done. – mrbTT Feb 08 '19 at 16:56
  • You're WRONG!! About it not working :) I followed your code and it provided me the correct output :) Thank you! Although I wish there were an inline command/function available in pandas to help with this... – atlas Feb 08 '19 at 17:03
  • Glad to help :) – mrbTT Feb 08 '19 at 17:05
1

You then need to sort your GroupBy object .sort_values('enrollees), ascending=False

mbass
  • 95
  • 2
  • 10
  • I've tried that as well...However, it sorts without keeping the integrity of the year index. It will just list, in descending order, the states with the highest sums without ordering them by year. – atlas Feb 08 '19 at 16:26