Some another solutions:
Filtered only column PWGTP
for aggregate sum, very important if more numeric columns:
pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
df.groupby('AgeGroups')['PWGTP'].sum())
print (pov_rate)
Only one groupby
with helper column filt
:
pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
.groupby('AgeGroups')[['filt','PWGTP']].sum()
.eval('filt / PWGTP'))
print (pov_rate)
Performance depends of number of groups, number of matched rows, number of numeric columns and length of Dataframe, so in real data should be different.
np.random.seed(2020)
N = 1000000
df = pd.DataFrame({'AgeGroups':np.random.randint(10000,size=N),
'POV': np.random.randint(50, 500, size=N),
'PWGTP':np.random.randint(100,size=N),
'a':np.random.randint(100,size=N),
'b':np.random.randint(100,size=N),
'c':np.random.randint(100,size=N)})
# print (df)
In [13]: %%timeit
...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups').sum()['PWGTP'] /
...: df.groupby('AgeGroups').sum()['PWGTP'])
...:
209 ms ± 7.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [14]: %%timeit
...: pov_rate = (df[df['POV'] <= 100].groupby('AgeGroups')['PWGTP'].sum() /
...: df.groupby('AgeGroups')['PWGTP'].sum())
...:
85.8 ms ± 332 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [15]: %%timeit
...: pov_rate = (df.assign(filt = df['PWGTP'].where(df['POV'] <= 100))
...: .groupby('AgeGroups')[['filt','PWGTP']].sum()
...: .eval('filt / PWGTP'))
...:
122 ms ± 388 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)