I have two dataframes.
Dataframe allocatedTimes
collects the allocated time (for a given activity) for a person identified by an ID.
ID: id of the person
startTime: allocated slot start datetime
endTime: allocated slot end datetime
weekday: calculated corresponding weekday
Dataframe events
collects the events associated to each person/ID.
ID: id of the person
eventStartTime: event start datetime
eventEndTime: event end datetime
weekday: calculated corresponding weekday
What I want: for each ID, count the number of events between the allocated time slots.
The problem: each allocated time slot is recurring, i.e. each person sets a few hours a day for each day of the week, so the allocatedTimes
dataframe collects all of those dates. Let's say PersonA has allocated five hours each Monday from 8am to 1pm, and set it as recurrent for a year. In the allocatedTimes
dataframe we'd have all Monday Datetimes from now oct/2022 to oct/2023 as such:
ID startTime endTime weekday
personA 2022-10-24 08:00:00 2022-10-24 13:00:00 Monday
personA 2022-10-31 08:00:00 2022-10-31 13:00:00 Monday
personA 2022-11-07 08:00:00 2022-11-07 13:00:00 Monday
personA ... ... ...
and in the events
dataframe, we can find some events like:
ID eventStartTime eventEndTime weekday
personA 2022-10-24 08:15:00 2022-10-24 08:30:00 Monday
personA 2022-10-24 09:40:00 2022-10-24 10:30:00 Monday
personA 2022-10-31 09:15:00 2022-10-31 09:30:00 Monday
personA 2022-10-31 10:15:00 2022-10-31 10:30:00 Monday
personA 2022-10-31 11:15:00 2022-10-31 11:30:00 Monday
personA 2022-11-07 11:00:00 2022-11-07 12:00:00 Monday
personA 2022-11-07 12:15:00 2022-11-07 12:40:00 Monday
personA ... ... ...
The result I want is something like this:
ID startTime endTime weekday nrEvents
personA 2022-10-24 08:00:00 2022-10-24 13:00:00 Monday 2
personA 2022-10-31 08:00:00 2022-10-31 13:00:00 Monday 3
personA 2022-11-07 08:00:00 2022-11-07 13:00:00 Monday 2
personB ... ... ... ...
My very straightforward attempt:
df = []
for id in list(allocatedTimes.ID):
res = allocatedTimes[allocatedTimes.ID == id]
for index, row in res.iterrows():
mask = (events.eventStartTime >= row.startTime) & (event.eventEndTime <= row.endTime)
data = {
'ID' : id,
'startTime' : row.startTime,
'endTime' : row.endTime,
'weekday' : row.weekday,
'nrEvents' : len(events.loc[mask])
}
df.append(data)
df = pd.DataFrame(df)
But this takes a lot of time to compute even on small subsets of those dataframes. So what I'm looking for is a more efficient way to reproduce this same result. Any help or suggestion is appreciated.