group-by with specific conditions and for each row

Question

I've been pulling my hair for this specific question for an hour.

I have a table with person-id, application-date, event-date, and event-type. I would like to check, for a person and application-date, how many events of a specific type that preceded the application-date. Where it gets hard for me is since the constraint on application-date is 'fixed'. Meaning that I want to check, for each combination of person-id and application-date, how many events that preceded it, and don't want the constraint to be applied on row-level. E.g. if the person-id is A, and application-date is 2022-01-01, I want to calculate number of events for all rows that are before 2022-01-01, and I don't want the condition to change to another date (e.g. 2021-01-01) depending on row.

I wrote a loop that is horribly inefficient and doesn't scale well.

Example data:

import pandas as pd
df = pd.DataFrame({'person_id': ['A', 'A', 'A','B'], 
               'applied_on': pd.to_datetime(["2020-01-01", "2020-01-01","2021-01-01","2021-01-01"]),
               'event_date': pd.to_datetime(["2020-03-01", "2020-09-01","2021-03-01","2020-03-01"]),
               'event_indicator': [1, 0, 0, 1]}) #event_indicator indicates whether an event is a certain type

My implementation:

unique = df[['person_id','applied_on']].drop_duplicates().reset_index()

result =[]
for index, row in unique[['person_id','applied_on']].iterrows():
    person = row['person_id']
    date = row['applied_on']
    loc = (df['person_id'] == person) & (df.event_date < date)
    numEvent = df.event_indicator[loc].sum()
    result.append([index, person, date, numEvent])

Appreciate any help here for making it better!

Example

Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. — itprorh66, Aug 23 '22 at 18:11

score 0 · Answer 1 · answered Aug 24 '22 at 09:16

I think the best approach from a simplicity and performance perspective is to precalculate the cumulative number of events that have occurred up to a given date for each person using groupby and cumsum, and then to look-up the application date from that data structure using timestamp indexing on each applied_on date.

## change the sample data so it's not all 1's and 0's
import pandas as pd

df = pd.DataFrame(
    {
        "person_id": ["A", "A", "A", "B"],
        "applied_on": pd.to_datetime(["2020-01-01", "2020-01-01", "2021-01-01", "2021-01-01"]),
        "event_date": pd.to_datetime(["2020-03-01", "2020-09-01", "2021-03-01", "2020-03-01"]),
        "event_indicator": [1, 1, 0, 1],
    }
)  # event_indicator indicates whether an event is a certain type


events_by = df.groupby("person_id").apply(
    lambda person_data: person_data.set_index("event_date").sort_index()["event_indicator"].cumsum()
)
print(events_by)
## person_id  event_date
## A          2020-03-01    1
##            2020-09-01    2
##            2021-03-01    2
## B          2020-03-01    1
## Name: event_indicator, dtype: int64

# then the lookup:
results=[]
for (person_id, applied_on), _ in df.groupby(["person_id", "applied_on"]):
    event_count = events_by[person_id][:applied_on][-1] if not events_by[person_id][:applied_on].empty else 0
    results.append([person_id, applied_on, event_count])
print(results)
## [['A', Timestamp('2020-01-01 00:00:00'), 0],
##  ['A', Timestamp('2021-01-01 00:00:00'), 2],
##  ['B', Timestamp('2021-01-01 00:00:00'), 1]]

This utilizes performant built-ins groupby, sort_index and cumsum functions in the first stage and the very clever pandas timestamp indexing in the second stage. I imagine such an approach would scale well.

group-by with specific conditions and for each row

1 Answers1