For each row, I need to get a) the number of and b) the mean score of all rows 28 days prior to that row's date
with the same member_ID
.
df:
member_ID score date past28
0 93 0.341937 20090515 20090417
1 496 0.075530 20090515 20090417
2 742 0.072468 20090515 20090417
3 668 NaN 20090515 20090417
4 351 0.581490 20090515 20090417
... ... ... ... ...
799995 792 NaN 20230225 20230128
799996 180 0.251769 20230225 20230128
799997 413 0.195291 20230225 20230128
799998 786 0.389303 20230225 20230128
799999 235 0.545911 20230225 20230128
Rows with NaN score values I want to include in the main loop (ie. get the row's qualifying rows), but I exclude these rows from qualifying_rows for the purpose of calculating valid score_counts and score_averages.
score_counts = pd.Series([0] * len(df))
score_averages = pd.Series([None] * len(df))
for id, id_rows in df.groupby('member_ID'):
for date, date_rows in id_rows.groupby('date'):
#There can be multiple member_ID on the same date, so I use the first index of date_rows to ensure I exclude current date from past_member_rows
past_member_rows = id_rows.loc[:date_rows.index[0]-1]
#For any given row, its past28 value is its date value minus 28 days
qualifying_rows = past_member_rows.loc[(past_member_rows['date'] >= date_rows['past28'].iloc[0]) & (~past_member_rows['score'].isnull())]
score_counts[date_rows.index] = len(qualifying_rows)
score_averages[date_rows.index] = qualifying_rows['score'].mean()
With this many .locs it obviously take a very long time and I'd love to hear some recommendations for how to re-think this. I usually employ vectorization but with this relatively complex lookup on a row by row basis I haven't worked out how to do so here.