0

Looking to match two data frames using times in one dataframe that fall into time windows of another dataframe.

Production Dataframe

Production Time Product Value Worker_ID
2020-01-24 08:13:59 Prod4 5.9 402
2020-01-24 08:15:38 Prod5 5.7 402
2020-01-24 08:17:17 Prod4 5.1 402
2020-01-25 22:13:59 Prod4 5.9 402
2020-01-25 21:15:38 Prod7 5.7 402
2020-01-26 02:17:17 Prod2 5.1 402
2020-01-24 09:17:17 Prod4 5.1 403
2020-01-25 21:13:59 Prod5 5.9 403

Location Dataframe

Location window_start window_stop Worker_ID
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402
Loc61 2020-01-24 05:00:00 2020-01-24 21:00:00 403
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 403

Results would look like this:

Location window_start window_stop Worker_ID Production Time Product Quality
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:13:59 Prod4 5.9
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:15:38 Prod5 5.7
Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00 402 2020-01-24 08:17:17 Prod4 5.1
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 22:13:59 Prod4 5.9
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-25 21:15:38 Prod7 5.7
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 402 2020-01-26 02:17:17 Prod2 5.1
Loc61 2020-01-24 05:00:00 2020-01-24 21:00:00 403 2020-01-24 09:17:17 Prod4 5.1
Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00 403 2020-01-25 21:13:59 Prod5 5.9

Where the match is made first on Worker_ID then where the Production datetime falls in the datetime window of the the location.

This code works:

possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) & 
                            (possible_matches['Production Time'] <= possible_matches['window_stop'])]

But does not work when there are millions of rows in the production dataframe and thousands of rows in the location dataframe.

Looking for a more efficient way of doing this join that actually works with large datasets with more workers and locations.

1 Answers1

1

To avoid crash, you may have to check datetimes before merging:

I tried to generate 2 dataframes with 10,000 records for location and 5,000,000 for production.

dti = pd.date_range('2020-01-01', '2021-01-01', freq='H', closed='left')

df2 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 10000)})
df2['window_start'] = np.random.choice(dti, len(df2))
df2['window_stop'] = df2['window_start'] + pd.DateOffset(hours=np.random.randint(4, 17))

df1 = pd.DataFrame({'Worker_ID': np.random.randint(100, 500, 5000000)})
df1['Production Time'] = pd.to_datetime(1e9 * np.random.randint(df2['window_start'].min().timestamp(), df2['window_stop'].max().timestamp(), len(df1)))
>>> df1
         Worker_ID     Production Time
0              263 2020-12-31 11:28:31
1              194 2020-09-19 04:57:17
2              139 2020-06-14 00:27:07
3              105 2020-04-14 02:45:05
4              484 2020-12-07 22:36:56
...            ...                 ...
4999995        338 2020-05-29 18:30:39
4999996        455 2020-03-03 20:51:27
4999997        228 2020-12-19 01:43:12
4999998        197 2020-04-07 07:32:13
4999999        304 2020-07-06 14:51:39

[5000000 rows x 2 columns]

>>> df2
      Worker_ID        window_start         window_stop
0           309 2020-10-07 18:00:00 2020-10-08 08:00:00
1           486 2020-01-24 19:00:00 2020-01-25 09:00:00
2           120 2020-11-05 10:00:00 2020-11-06 00:00:00
3           224 2020-04-08 15:00:00 2020-04-09 05:00:00
4           208 2020-01-08 23:00:00 2020-01-09 13:00:00
...         ...                 ...                 ...
9995        218 2020-01-10 00:00:00 2020-01-10 14:00:00
9996        358 2020-10-12 03:00:00 2020-10-12 17:00:00
9997        474 2020-12-25 03:00:00 2020-12-25 17:00:00
9998        416 2020-10-26 20:00:00 2020-10-27 10:00:00
9999        443 2020-03-31 09:00:00 2020-03-31 23:00:00

[10000 rows x 3 columns]
# from tqdm import tqdm

# Convert datetime to arrays of int
ptime = df1['Production Time'].astype(int).values
wtime = df2[['window_start', 'window_stop']].astype(int).values

data = []
# for wid in tqdm(df2['Worker_ID'].unique()):
for wid in df2['Worker_ID'].unique():
    i = df1.loc[df1['Worker_ID'] == wid]
    j = df2.loc[df2['Worker_ID'] == wid]

    m = [np.where((wtime[j.index, 0] <= p) & (p <= wtime[j.index, 1]), x, -1)
             for x, p in enumerate(ptime[i.index])]
    m = np.where(np.array(m) >= 0)

    df = pd.concat([j.iloc[m[1]].reset_index(drop=True),
                    i.iloc[m[0]].reset_index(drop=True)], axis='columns')

    data.append(df)
df = pd.concat(data)

Old answer Create and interval index to bind each production time to the corresponding window and merge on Worker_ID and the interval:

ii = pd.IntervalIndex.from_tuples(list(zip(dfl['window_start'], dfl['window_stop'])),
                                  closed='left')  # left means >= and <

dfp['interval'] = pd.cut(dfp['Production Time'], bins=ii)
dfl['interval'] = ii
>>> pd.merge(dfl, dfp, on=['Worker_ID', 'interval'], how='left') \
      .drop(columns='interval')

  Location        window_start         window_stop  Worker_ID     Production Time Product  Value
0    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:13:59   Prod4    5.9
1    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:15:38   Prod5    5.7
2    Loc16 2020-01-24 05:00:00 2020-01-24 21:00:00        402 2020-01-24 08:17:17   Prod4    5.1
3    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 22:13:59   Prod4    5.9
4    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-25 21:15:38   Prod7    5.7
5    Loc27 2020-01-25 21:00:00 2020-01-26 05:00:00        402 2020-01-26 02:17:17   Prod2    5.1
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • i think the OP wants >= and <=; you probably should set closed to `both` – sammywemmy Jun 26 '21 at 22:36
  • This is great! My example however, left out an important aspect of the real data. The location log has overlapping windows (many workers) so, the interval index is unhappy. – lowjumpingfrog Jun 27 '21 at 23:08
  • Kindly share more data with the overlapping windows – sammywemmy Jun 29 '21 at 13:19
  • I updated the tables above. Essentially it is adding another worker that is working a shift with the same time windows. – lowjumpingfrog Jun 30 '21 at 22:29
  • Wow that is fast!! Yes, that works. I think you meant to include the datetime index line: dti = pd.date_range("2020-01-01", periods=10000, freq="H") and the filter for df1 and df2 in the worker_id loop for i & j would be == wid Thanks, this is great! – lowjumpingfrog Jul 02 '21 at 00:19
  • Yes, and your comment about checking the data frames before merging is also pertinent. I needed to reindex one of the data frames. – lowjumpingfrog Jul 02 '21 at 00:25
  • I fixed my code. I had no more attention yesterday :-) – Corralien Jul 02 '21 at 04:34
  • One correction for testing the matches: m = np.where(np.array(m) > 0) Should be m = np.where(np.array(m) >= 0) Match can happen on the first element in the list. – lowjumpingfrog Jul 10 '21 at 18:55