Joining data frames with datetime windows

Question

Looking to match two data frames using times in one dataframe that fall into time windows of another dataframe.

Production Dataframe

Production Time	Product	Value	Worker_ID
2020-01-24 08:13:59	Prod4	5.9	402
2020-01-24 08:15:38	Prod5	5.7	402
2020-01-24 08:17:17	Prod4	5.1	402
2020-01-25 22:13:59	Prod4	5.9	402
2020-01-25 21:15:38	Prod7	5.7	402
2020-01-26 02:17:17	Prod2	5.1	402
2020-01-24 09:17:17	Prod4	5.1	403
2020-01-25 21:13:59	Prod5	5.9	403

Location Dataframe

Location	window_start	window_stop	Worker_ID
Loc16	2020-01-24 05:00:00	2020-01-24 21:00:00	402
Loc27	2020-01-25 21:00:00	2020-01-26 05:00:00	402
Loc61	2020-01-24 05:00:00	2020-01-24 21:00:00	403
Loc27	2020-01-25 21:00:00	2020-01-26 05:00:00	403

Results would look like this:

Location	window_start	window_stop	Worker_ID	Production Time	Product	Quality
Loc16	2020-01-24 05:00:00	2020-01-24 21:00:00	402	2020-01-24 08:13:59	Prod4	5.9
Loc16	2020-01-24 05:00:00	2020-01-24 21:00:00	402	2020-01-24 08:15:38	Prod5	5.7
Loc16	2020-01-24 05:00:00	2020-01-24 21:00:00	402	2020-01-24 08:17:17	Prod4	5.1
Loc27	2020-01-25 21:00:00	2020-01-26 05:00:00	402	2020-01-25 22:13:59	Prod4	5.9
Loc27	2020-01-25 21:00:00	2020-01-26 05:00:00	402	2020-01-25 21:15:38	Prod7	5.7
Loc27	2020-01-25 21:00:00	2020-01-26 05:00:00	402	2020-01-26 02:17:17	Prod2	5.1
Loc61	2020-01-24 05:00:00	2020-01-24 21:00:00	403	2020-01-24 09:17:17	Prod4	5.1
Loc27	2020-01-25 21:00:00	2020-01-26 05:00:00	403	2020-01-25 21:13:59	Prod5	5.9

Where the match is made first on Worker_ID then where the Production datetime falls in the datetime window of the the location.

This code works:

possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) & 
                            (possible_matches['Production Time'] <= possible_matches['window_stop'])]

But does not work when there are millions of rows in the production dataframe and thousands of rows in the location dataframe.

Looking for a more efficient way of doing this join that actually works with large datasets with more workers and locations.

What do you mean by "does not work when there are millions of rows in the production dataframe and thousands of rows in the location dataframe"? Are you trying to say that it takes too long? — QuaternionsRock, Jun 26 '21 at 20:35
This [post](https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others) has a good solution. — lowjumpingfrog, Jun 28 '21 at 18:28

Corralien · Accepted Answer · 2021-07-10T19:44:16.890

To avoid crash, you may have to check datetimes before merging:

I tried to generate 2 dataframes with 10,000 records for location and 5,000,000 for production.

dti = pd.date_range('2020-01-01', '2021-01-01', freq='H', closed='left')

df2 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 10000)})
df2['window_start'] = np.random.choice(dti, len(df2))
df2['window_stop'] = df2['window_start'] + pd.DateOffset(hours=np.random.randint(4, 17))

df1 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 5000000)})
df1['Production Time'] = pd.to_datetime(1e9 * np.random.randint(df2['window_start'].min().timestamp(), df2['window_stop'].max().timestamp(), len(df1)))

>>> df1
         Worker_ID     Production Time
0              263 2020-12-31 11:28:31
1              194 2020-09-19 04:57:17
2              139 2020-06-14 00:27:07
3              105 2020-04-14 02:45:05
4              484 2020-12-07 22:36:56
...            ...                 ...
4999995        338 2020-05-29 18:30:39
4999996        455 2020-03-03 20:51:27
4999997        228 2020-12-19 01:43:12
4999998        197 2020-04-07 07:32:13
4999999        304 2020-07-06 14:51:39

[5000000 rows x 2 columns]

>>> df2
      Worker_ID        window_start         window_stop
0           309 2020-10-07 18:00:00 2020-10-08 08:00:00
1           486 2020-01-24 19:00:00 2020-01-25 09:00:00
2           120 2020-11-05 10:00:00 2020-11-06 00:00:00
3           224 2020-04-08 15:00:00 2020-04-09 05:00:00
4           208 2020-01-08 23:00:00 2020-01-09 13:00:00
...         ...                 ...                 ...
9995        218 2020-01-10 00:00:00 2020-01-10 14:00:00
9996        358 2020-10-12 03:00:00 2020-10-12 17:00:00
9997        474 2020-12-25 03:00:00 2020-12-25 17:00:00
9998        416 2020-10-26 20:00:00 2020-10-27 10:00:00
9999        443 2020-03-31 09:00:00 2020-03-31 23:00:00

[10000 rows x 3 columns]

# from tqdm import tqdm

# Convert datetime to arrays of int
ptime = df1['Production Time'].astype(int).values
wtime = df2[['window_start', 'window_stop']].astype(int).values

data = []
# for wid in tqdm(df2['Worker_ID'].unique()):
for wid in df2['Worker_ID'].unique():
    i = df1.loc[df1['Worker_ID'] == wid]
    j = df2.loc[df2['Worker_ID'] == wid]

    m = [np.where((wtime[j.index, 0] <= p) & (p <= wtime[j.index, 1]), x, -1)
             for x, p in enumerate(ptime[i.index])]
    m = np.where(np.array(m) >= 0)

    df = pd.concat([j.iloc[m[1]].reset_index(drop=True),
                    i.iloc[m[0]].reset_index(drop=True)], axis='columns')

    data.append(df)
df = pd.concat(data)

Old answer Create and interval index to bind each production time to the corresponding window and merge on Worker_ID and the interval:

ii = pd.IntervalIndex.from_tuples(list(zip(dfl['window_start'], dfl['window_stop'])),
                                  closed='left')  # left means >= and <

dfp['interval'] = pd.cut(dfp['Production Time'], bins=ii)
dfl['interval'] = ii

>>> pd.merge(dfl, dfp, on=['Worker_ID', 'interval'], how='left') \
      .drop(columns='interval')

  Location        window_start         window_stop  Worker_ID     Production Time Product  Value
0    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:13:59   Prod4    5.9
1    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:15:38   Prod5    5.7
2    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:17:17   Prod4    5.1
3    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 22:13:59   Prod4    5.9
4    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 21:15:38   Prod7    5.7
5    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-26 02:17:17   Prod2    5.1

i think the OP wants >= and <=; you probably should set closed to `both` — sammywemmy, Jun 26 '21 at 22:36
This is great! My example however, left out an important aspect of the real data. The location log has overlapping windows (many workers) so, the interval index is unhappy. — lowjumpingfrog, Jun 27 '21 at 23:08
I updated the tables above. Essentially it is adding another worker that is working a shift with the same time windows. — lowjumpingfrog, Jun 30 '21 at 22:29
Wow that is fast!! Yes, that works. I think you meant to include the datetime index line: dti = pd.date_range("2020-01-01", periods=10000, freq="H") and the filter for df1 and df2 in the worker_id loop for i & j would be == wid Thanks, this is great! — lowjumpingfrog, Jul 02 '21 at 00:19
Yes, and your comment about checking the data frames before merging is also pertinent. I needed to reindex one of the data frames. — lowjumpingfrog, Jul 02 '21 at 00:25
One correction for testing the matches: m = np.where(np.array(m) > 0) Should be m = np.where(np.array(m) >= 0) Match can happen on the first element in the list. — lowjumpingfrog, Jul 10 '21 at 18:55

Joining data frames with datetime windows

1 Answers1

Linked