3

I would like to group events that happened close together into a parent event.

An example is this:

import pandas as pd

df = pd.DataFrame(
    [
        ['2020-01-01 10:00', '1'],
        ['2020-01-01 10:01', '2'],
        ['2020-01-01 10:02', '3a'],
        ['2020-01-01 10:02', '3b'],
        ['2020-01-01 10:30', '4'],
        ['2020-01-01 10:50', '5'],
        ['2020-01-01 10:54', '6'],
        ['2020-01-01 10:55', '7'],
    ], columns=['event_time', 'event_id']
)
df['event_time'] = pd.to_datetime(df['event_time'])

In the above example with a window size of 1 minute what I want is for each event a list of all other events that occurred within +- 1 minute of it.

So something like:

df = pd.DataFrame(
    [
        ['2020-01-01 10:00', '1', ['2']],
        ['2020-01-01 10:01', '2', ['1','3a','3b']],
        ['2020-01-01 10:02', '3a', ['2','3b']],
        ['2020-01-01 10:02', '3b', ['3a', '2'],
        ['2020-01-01 10:30', '4', None],
        ['2020-01-01 10:50', '5', None],
        ['2020-01-01 10:54', '6', ['7']],
        ['2020-01-01 10:55', '7', ['6']],
    ], columns=['event_time', 'event_id', 'related_event_id_list']
)

I was almost able to get close with pandas merge_asof:

pd.merge_asof(df,df, on="event_time", tolerance=pd.Timedelta("1m"), allow_exact_matches=False, direction='nearest')

enter image description here

but it seems to just want to merge to one nearest event as opposed to some sort of option or way to merge all events within the tolerance for each row (and obviously just get more rows in my results so its more like an outer join on a range).

Was thinking grouping time-series events like this must be fairly common but could not really find any obvious functions in Pandas to do this or maybe I'm missing something.

As always trying to avoid writing the code myself via loops or anything like that if I can avoid it :)

andrewm4894
  • 1,451
  • 4
  • 17
  • 37

1 Answers1

2

One idea is filter DataFrame per rows with remove original row by DataFrame.drop:

td = pd.Timedelta("1m")
f = lambda x, y: df.loc[df['event_time'].between(y - td, y + td),
                        'event_id'].drop(x).tolist()
df['related_event_id_list'] = [f(k, v) for k, v in df['event_time'].items()]
print (df)
           event_time event_id related_event_id_list
0 2020-01-01 10:00:00        1                   [2]
1 2020-01-01 10:01:00        2           [1, 3a, 3b]
2 2020-01-01 10:02:00       3a               [2, 3b]
3 2020-01-01 10:02:00       3b               [2, 3a]
4 2020-01-01 10:30:00        4                    []
5 2020-01-01 10:50:00        5                    []
6 2020-01-01 10:54:00        6                   [7]
7 2020-01-01 10:55:00        7                   [6]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Brilliant solution, but I'm working on 300K rows. It took ~10 hours to throw an error: MemoryError: Unable to allocate 247. KiB for an array with shape (31570,) and data type int64. Is there another way to avoid this? – DrWhat Mar 24 '23 at 06:40
  • @DrWhat - there is no problem very large list? – jezrael Mar 24 '23 at 06:42
  • 10K was quick, 40K a few minutes, 100K is taking over an hour, 300K threw above error. I could split the dataframe up I guess. I've just raised a more complex question: https://stackoverflow.com/questions/75831723/pandas-group-events-close-together-by-date-then-test-if-other-values-are-equal – DrWhat Mar 24 '23 at 09:07
  • @DrWhat - Do you need check values per groups by `Address` ? – jezrael Mar 24 '23 at 09:12
  • I ask because if run solution per groups it should be mor optimize. But need test per groups in https://stackoverflow.com/questions/75831723/pandas-group-events-close-together-by-date-then-test-if-other-values-are-equal? – jezrael Mar 24 '23 at 11:01