My data has few 100k of rows and the following columns:
Time
(yyyy-MM-dd hh:mm:ss.ffffff),ID
(string),Group1
(int32),Group2
(int32).
I would like to count how many events came from the same Group1
and Group2
prior to each event, in a time window of 5 minutes.
For example:
ID Time Group1 Group2
61ED2269CCAC 2020-07-27 00:01:05.781 1234 100123
61C2DC4E96FA 2020-07-27 00:01:17.279 1234 100123
FAD0839C1A95 2020-07-27 00:02:38.112 1234 100124
A2750A7B6C24 2020-07-27 00:16:50.592 4321 100123
03F5DF150A3C 2020-07-27 00:17:00.246 4321 100124
How many events came after Timestamp('2020-07-26 23:56:17.279000')
(5 minutes before the second event) and belong to groups Group1
and Group2
? So the second event would have a counter of 1 in this example. The rest will have a counter of 0 as their groups are unique.
Each event should have a counter indicating how many events came before it from the same groups.
I tried sorting the data by the groups and by Time
, then running a nested-loop, one running on all events and one running from the start up to the current event index. After few 1000 rows, the process slows down significantly, rendering this option non-feasible.
I was wondering if there is any other elegant and efficient way of doing this.
EDIT: I was able to do this with one for-loop instead of a nested one. For each loop I took the Time and the groups and sliced the Dataframe to include the events in the groups and in the desired time frame, and then summed the number of events:
for i in tqdm(range(len(df))):
time_stamp = df.loc[i, 'Time']
group1 = df.loc[i, 'Group1']
group2 = df.loc[i, 'Group2']
sub_df = df[df['Time'] + timedelta(minutes=-5) > time_stamp]
sub_df = sub_df[sub_df['Time'] < time_stamp]
sub_df = sub_df[sub_df['Group1'] == group1]
sub_df = sub_df[sub_df['Group2'] == group2]
df.loc[i, 'prior_events'] = sub_df.size
Still, tqdm shows 18 iteration per second, which is not that great for 100k rows.