Pandas Group By Filtering Simplification

Question

I've got a data frame that has been binned into age groups ('AgeGroups' column), then filtered to those below the poverty level (100). I'm wondering if there is a simple way to calculate the count of those below poverty divided by the total amount of people, or poverty rate. This works but doesn't seem very pythonic.

The "PWGTP" column is the weight used to sum in this scenario.

pov_rate = df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /df.groupby('AgeGroups').sum()['PWGTP']

Thank you

Please include some sample data and sample expected output. Please see: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — David Erickson, Sep 26 '20 at 03:20

score 1 · Accepted Answer · answered Sep 26 '20 at 04:38

1

It's not clear from your description why you need a groupby. The data is already binned. Why not simply create a poverty rate column?

df['pov_rate']=(df['POV']<100)*df['PWGTP']/df['PWGTP'].sum()

answered Sep 26 '20 at 04:38

LevB

925
6
10

jezrael · Answer 2 · 2020-09-26T05:14:37.600

Some another solutions:

Filtered only column PWGTP for aggregate sum, very important if more numeric columns:

pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
            df.groupby('AgeGroups')['PWGTP'].sum())
print (pov_rate)

Only one groupby with helper column filt:

pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
              .groupby('AgeGroups')[['filt','PWGTP']].sum()
              .eval('filt / PWGTP'))

print (pov_rate)

Performance depends of number of groups, number of matched rows, number of numeric columns and length of Dataframe, so in real data should be different.

np.random.seed(2020)

N = 1000000
df = pd.DataFrame({'AgeGroups':np.random.randint(10000,size=N),
                   'POV': np.random.randint(50, 500, size=N),
                   'PWGTP':np.random.randint(100,size=N),
                   'a':np.random.randint(100,size=N),
                   'b':np.random.randint(100,size=N),
                   'c':np.random.randint(100,size=N)})
# print (df)

In [13]: %%timeit
    ...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /
    ...:             df.groupby('AgeGroups').sum()['PWGTP'])
    ...:             
209 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]: %%timeit 
    ...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
    ...:             df.groupby('AgeGroups')['PWGTP'].sum())
    ...:             
85.8 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [15]: %%timeit
    ...: pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
    ...:               .groupby('AgeGroups')[['filt','PWGTP']].sum()
    ...:               .eval('filt / PWGTP'))
    ...:               
122 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas Group By Filtering Simplification

2 Answers2