2

The problem: group together events that occur close to each other in time, that also have another variable that is equal. For example, given the date of disease onset, and an address, find disease outbreaks that occur at the same location within specified timeframe of each other. Large - 300K rows - pandas dataframe. Example data:

df = pd.DataFrame(
    [
    ['2020-01-01 10:00', '1', 'A'],
    ['2020-01-01 10:01', '2', 'A'],
    ['2020-01-01 10:02', '3a', 'A'],
    ['2020-01-01 10:02', '3b', 'A'],
    ['2020-01-02 10:03', '4', 'B'],
    ['2020-01-02 10:50', '5', 'B'],
    ['2020-01-02 10:54', '6', 'B'],
    ['2020-01-02 10:55', '7', 'B'],
], columns=['event_time', 'event_id', 'Address']
)

The output should have rows with the first and last event date, a list of the the events and the address

    event_time_start  event_time_end     events_and_related_event_id_list  Address
0 2020-01-01 10:00:00  2020-01-01 10:02:00        [1, 2, 3a]     A
6 2020-01-01 10:54:00  2020-01-01 10:55:00        [6, 7]         B

EDITED - to clarify - SOLUTION

The solution by jezrael to match dates within a specified number of days before or after a date is based on a similar approach from another thread, but includes a groupby for the Address. This first step works perfectly without modification on the real data. It is not changed below, except to name some of the values for clarity.

The second step did not work because, unlike the example data, the real data contained non-continuous and non-sequential events. This required: sorting of the first output by Address and event_time; different logic for the boolean series to groups event_times together (m/timeGroup_bool); and removal of the bool series as df filter for the Groupby.agg.

Here is the full solution with modifications and clarifications based on jezrael's simply awesome response (the f1 lambda, which collects all values from the grouped lists, is best explained here).:

    df = pd.DataFrame(
    [
        ['1', 'A', '2020-01-01 10:00'],
        ['2', 'B', '2020-01-01 10:01'],
        ['3', 'A', '2020-01-01 10:01'],
        ['4', 'C', '2020-01-01 10:02'],
        ['5', 'D', '2020-01-01 10:03'],
        ['6', 'A', '2020-01-01 10:03'],
        ['7', 'E', '2020-01-01 10:03'],
        ['8', 'A', '2020-01-01 10:07'],
        ['9', 'A', '2020-01-01 10:09'],
        ['10', 'A', '2020-01-01 10:11'],
        ['11', 'F', '2020-01-01 10:54'],
        ['12', 'G', '2020-01-01 10:55'],
        ['13', 'F', '2020-01-01 10:56'],
    ], columns=['id', 'Address', 'event_time']
)
df = df.sort_values(by=["Address", "event_time"])
df['event_time'] = pd.to_datetime(df['event_time'])

## group by address and surrounding time
timeDiff = pd.Timedelta("2m")       # time span between related events
def idsNearDates(mDf):
    f = lambda colName, val: mDf.loc[mDf['event_time'].between(val - timeDiff, val + timeDiff),
                            'id'].drop(colName).tolist()
    mDf['relatedIds'] = [f(colName, value) for colName, value in mDf['event_time'].items()]
    return mDf
df_1stStep  = df.groupby('Address').apply(idsNearDates).sort_values(by=["Address", 'event_time'])

## aggregate the initial output into a single row per related events
# mark where event times are too far apart
timeGroup_bool = ~(df_1stStep['event_time'].between(df_1stStep['event_time'].shift(1) - timeDiff,
                                                             df_1stStep['event_time'].shift(1) + timeDiff))
# create a single list from all grouped lists
f1 = lambda x: list(dict.fromkeys([value for idList in x for value in idList]))
df_2ndstep = (df_1stStep.groupby([(timeGroup_bool).cumsum(),'Address'])
                   .agg(Date_first=('event_time','min'),
                        Date_last=('event_time','max'),
                        Ids=('relatedIds',f1))
                   .droplevel(0)
                   .reset_index())
# get rid of rows with empty lists
df_2ndstep = df_2ndstep[df_2ndstep['Ids'].str.len() > 0]
DrWhat
  • 2,360
  • 5
  • 19
  • 33

2 Answers2

2

You can use numpy broadcast to do the aggregation operation:

def find_related_event(df):
    evt = df['event_time'].values
    out = np.abs(evt[:, None] - evt) <= pd.Timedelta('1m')
    out[np.diag_indices(out.shape[0])] = False
    df1 = df.loc[out.any(axis=1)]
    return pd.Series({'index': df1.index[0],
        'event_time_start': df1['event_time'].iloc[0],
        'event_time_stop': df1['event_time'].iloc[-1],
        'events_and_related_event_id_list': df1['event_id'].tolist()
    })

out = (df.groupby('Address', as_index=False).apply(find_related_event)
         .set_index('index').rename_axis(None)

Output:

>>> out
  Address    event_time_start     event_time_stop events_and_related_event_id_list
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00                       [1, 2, 3a]
6       B 2020-01-01 10:54:00 2020-01-01 10:55:00                           [6, 7]

Alternative

def find_related_event(evt):
    out = np.abs(evt.values[:, None] - evt.values) <= pd.Timedelta('1m')
    out[np.diag_indices(out.shape[0])] = False
    return out.any(axis=1)

m = df.groupby('Address')['event_time'].transform(find_related_event)
out = df.loc[m].groupby('Address', as_index=False).agg(
            event_time_start=('event_time', 'first'),
            event_time_stop=('event_time', 'first'),
            events_and_related_event_id_list=('event_id', list)
      )
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Yes, nice. I have to solve a small issue with the iloc reference in the first alternative, because my data is structured differently, but it certainly works. The second alternative works, but with my real data, I get different results to jezrael (his has more rows). I have to sleep for 2 days to work out what is happening. Thanks very much for this amazing code. – DrWhat Mar 24 '23 at 14:05
1

You can use thi solution per groups:

# solution matching dates within range of date in row by jezrael
df['event_time'] = pd.to_datetime(df['event_time'])

def f(g):
    td = pd.Timedelta("1m")
    f = lambda x, y: g.loc[g['event_time'].between(y - td, y + td),
                            'event_id'].drop(x).tolist()
    g['related_event_id_list'] = [f(k, v) for k, v in g['event_time'].items()]
    return g

df  = df.groupby('Address').apply(f)
print (df)
           event_time event_id Address related_event_id_list
0 2020-01-01 10:00:00        1       A                   [2]
1 2020-01-01 10:01:00        2       A               [1, 3a]
2 2020-01-01 10:02:00       3a       A                   [2]
3 2020-01-01 10:02:00       3b       B                    []
4 2020-01-01 10:30:00        4       B                    []
5 2020-01-01 10:50:00        5       B                    []
6 2020-01-01 10:54:00        6       B                   [7]
7 2020-01-01 10:55:00        7       B                   [6]

For next step use GroupBy.agg for groups created by consecutive non empty list values in related_event_id_list column:

m = df['related_event_id_list'].astype(bool)

f1 = lambda x: list(dict.fromkeys([z for y in x for z in y]))

df = (df[m].groupby([(~m).cumsum(),'Address'])
           .agg(event_time_start=('event_time','min'),
                event_time_end=('event_time','max'),
                events_and_related_event_id_list=('related_event_id_list',f1))
           .droplevel(0)
           .reset_index())
print (df)
  Address    event_time_start      event_time_end  \
0       A 2020-01-01 10:00:00 2020-01-01 10:02:00   
1       B 2020-01-01 10:54:00 2020-01-01 10:55:00   

  events_and_related_event_id_list  
0                       [2, 1, 3a]  
1                           [7, 6]  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Awesome. Totally works with the example. Applying solution to my data, it managed 124K rows quickly, but the original Id is not added to the final list - events_and_related_event_id_list. I guess f1 is where this is happening. I might need some sleep to work it out. Thank you!!! – DrWhat Mar 24 '23 at 13:34