how to do a proper pandas group by, where, sum

Question

i am having a hard time using a group by + where to apply a sum to a broader range.

given this code:

from io import StringIO
import numpy as np

f = pd.read_csv(StringIO("""
fund_id,l_s,val
fund1,L,10
fund1,L,20
fund1,S,30
fund2,L,15
fund2,L,25
fund2,L,35
"""))


# fund total - works as expected
f['fund_total'] = f.groupby('fund_id')['val'].transform(np.sum)
# fund L total - applied only to L rows. 
f['fund_total_l'] = f[f['l_s'] == "L"].groupby('fund_id')['val'].transform(np.sum)
f

this code gets me close:

numbers are correct, but i would like fund_total_l column to show 30 for all rows of fund1 (not just L). I want a fund level summary, but sum filtered by the l_s column

i know i can do this with multiple steps, but this needs to be a single operation. i can use a separate generic function if that helps.

playground: https://repl.it/repls/UnusualImpeccableDaemons

Erfan · Accepted Answer · 2019-12-27T22:09:43.053

3

Use Series.where, to create NaN, these will be ignored in your sum:

f['val_temp'] = f['val'].where(f['l_s'] == "L")
f['fund_total_l'] = f.groupby('fund_id')['val_temp'].transform('sum')
f = f.drop(columns='val_temp')

Or in one line using assign:

df['fun_total_l'] = (
    f.assign(val_temp=f['val'].where(f['l_s'] == "L"))
    .groupby('fund_id')['val_temp'].transform('sum')
)

Another way would be to partly use your solution, but then use DataFrame.reindex to get the original index back and then use ffill and bfill to fill up our NaN:

f['fund_total_l'] = (
    f[f['l_s'] == "L"]
    .groupby('fund_id')['val']
    .transform('sum')
    .reindex(f.index)
    .ffill()
    .bfill()
)


  fund_id l_s  val  fund_total_l
0   fund1   L   10          30.0
1   fund1   L   20          30.0
2   fund1   S   30          30.0
3   fund2   L   15          75.0
4   fund2   L   25          75.0
5   fund2   L   35          75.0

edited Dec 27 '19 at 22:09

answered Dec 27 '19 at 21:15

Erfan

40,971
8
66
78

i can't use a temporary column. for my needs it needs to be a generic operation that returns a new series. custom external functions are allowed. i guess i could plug it inline ? – Sonic Soul Dec 27 '19 at 22:04
Did you check my second solution? Custom external functions can be really slow. Read [here](https://stackoverflow.com/a/54432584/9081267) – Erfan Dec 27 '19 at 22:04
Added third solution which also does not create a temp column. – Erfan Dec 27 '19 at 22:10
i like that 2nd one. thanks! just wondering if it could be simplified using something like `np.sum` since it has a `where` parameter? or maybe put a part of it in a generic func ? – Sonic Soul Dec 27 '19 at 22:17
Hmm, not sure if that would work with `GroupBy.transform` – Erfan Dec 27 '19 at 22:27
You can try something like: `f.assign(boolcol=f['l_s'].eq('L')).groupby('fund_id').transform(lambda x: np.sum(x['val'].to_numpy(), where=x['boolcol']))` – Erfan Dec 27 '19 at 22:27
It's not terribly different, though it's possible to group by a Series that shares the same index. So you can `where` one series, and group by another unmasked Series to get ensure you have non-masked group keys for every row. `f['val'].where(f.l_s.eq('L')).groupby(f.fund_id).transform('sum')` – ALollz Dec 27 '19 at 23:03

Quickbeam2k1 · Answer 2 · 2019-12-28T06:11:25.340

1

I think there is a more elegant solution, but I'm not able to broadcast the results back to the individual rows.

Essentially, with a boolean mask of all the "L" rows

f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L")))

you obtain

       fund_id
fund1    30
fund2    75
dtype: int64

now we can just merge after using reset_index to obtain

pd.merge(f, f.groupby("fund_id").apply(lambda g:sum(g["val"]*(g["l_s"]=="L"))).reset_index(), on="fund_id")

to obtain

  fund_id l_s  val   0
0   fund1   L   10  30
1   fund1   L   20  30
2   fund1   S   30  30
3   fund2   L   15  75
4   fund2   L   25  75
5   fund2   L   35  75

However, I'd guess that the merging is not necessary and can be obtained directly in apply

edited Dec 28 '19 at 06:11

answered Dec 27 '19 at 21:36

Quickbeam2k1

5,287
2
26
42

hmm when i run in my local notebook i get `ValueError: Wrong number of items passed 2, placement implies 1` but when i run in the repl i provided, there are no errors but also no new column. – Sonic Soul Dec 29 '19 at 18:08
if i simplify to remove the merge, i see a series with 2 rows.. which is good! just need it to return a full series.. will try to figure out why my join fails with that error – Sonic Soul Dec 29 '19 at 18:09
in your recent code (on repl.it) just run `pd.merge(f, x, on="fund_id")` which gives the desired output. I just ran it there. No error is raised – Quickbeam2k1 Jan 01 '20 at 20:55

how to do a proper pandas group by, where, sum

2 Answers2