28

I am trying to use groupby, nlargest, and sum functions in Pandas together, but having trouble making it work.

State    County    Population
Alabama  a         100
Alabama  b         50
Alabama  c         40
Alabama  d         5
Alabama  e         1
...
Wyoming  a.51      180
Wyoming  b.51      150
Wyoming  c.51      56
Wyoming  d.51      5

I want to use groupby to select by state, then get the top 2 counties by population. Then use only those top 2 county population numbers to get a sum for that state.

In the end, I'll have a list that will have the state and the population (of it's top 2 counties).

I can get the groupby and nlargest to work, but getting the sum of the nlargest(2) is a challenge.

The line I have right now is simply: df.groupby('State')['Population'].nlargest(2)

Koray Tugay
  • 22,894
  • 45
  • 188
  • 319
user7102752
  • 281
  • 1
  • 3
  • 4

2 Answers2

47

You can use apply after performing the groupby:

df.groupby('State')['Population'].apply(lambda grp: grp.nlargest(2).sum())

I think this issue you're having is that df.groupby('State')['Population'].nlargest(2) will return a DataFrame, so you can no longer do group level operations. In general, if you want to perform multiple operations in a group, you'll need to use apply/agg.

The resulting output:

State
Alabama    150
Wyoming    330

EDIT

A slightly cleaner approach, as suggested by @cs95:

df.groupby('State')['Population'].nlargest(2).sum(level=0)

This is slightly slower than using apply on larger DataFrames though.

Using the following setup:

import numpy as np
import pandas as pd
from string import ascii_letters

n = 10**6
df = pd.DataFrame({'A': np.random.choice(list(ascii_letters), size=n),
                   'B': np.random.randint(10**7, size=n)})

I get the following timings:

In [3]: %timeit df.groupby('A')['B'].apply(lambda grp: grp.nlargest(2).sum())
103 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [4]: %timeit df.groupby('A')['B'].nlargest(2).sum(level=0)
147 ms ± 3.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The slower performance is potentially caused by the level kwarg in sum performing a second groupby under the hood.

ali bakhtiari
  • 1,051
  • 4
  • 23
root
  • 32,715
  • 6
  • 74
  • 87
  • 3
    You should fix this by getting rid of the apply, using `df.groupby('State')['Population'].nlargest(2).sum(level=0)` – cs95 Dec 28 '17 at 18:57
  • 2
    @cᴏʟᴅsᴘᴇᴇᴅ: your proposed solution of `.nlargest(2).sum(level=0)` is actually slower than using `apply` when the dataframe size is substantial. The `level` kwarg in `sum` performs a second groupby operation under the hood, which I'm guessing is where the extra overhead comes from. – root Dec 28 '17 at 19:05
  • 3
    That's surprising. So one groupby + apply trumps two groupbys. Learned something new, cheers! – cs95 Dec 28 '17 at 19:07
  • 1
    Will your solution work if `county`s were not unique? Imagine there are 2 rows like: `Alabama;e;2` and `Alabama;e;39` ? Would it `e` be considered instead of `c`? – Koray Tugay Sep 08 '18 at 02:39
7

Using agg, the grouping logic looks like:

df.groupby('State').agg({'Population': {lambda x: x.nlargest(2).sum() }})

This results in another dataframe object; which you could query to find the most populous states, etc.

           Population
State
Alabama    150
Wyoming    330
aquaraga
  • 4,138
  • 23
  • 29