Looking to match two data frames using times in one dataframe that fall into time windows of another dataframe.
Production Dataframe
Production Time | Product | Value | Worker_ID |
---|---|---|---|
2020-01-24 08:13:59 | Prod4 | 5.9 | 402 |
2020-01-24 08:15:38 | Prod5 | 5.7 | 402 |
2020-01-24 08:17:17 | Prod4 | 5.1 | 402 |
2020-01-25 22:13:59 | Prod4 | 5.9 | 402 |
2020-01-25 21:15:38 | Prod7 | 5.7 | 402 |
2020-01-26 02:17:17 | Prod2 | 5.1 | 402 |
2020-01-24 09:17:17 | Prod4 | 5.1 | 403 |
2020-01-25 21:13:59 | Prod5 | 5.9 | 403 |
Location Dataframe
Location | window_start | window_stop | Worker_ID |
---|---|---|---|
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 |
Loc61 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 403 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 403 |
Results would look like this:
Location | window_start | window_stop | Worker_ID | Production Time | Product | Quality |
---|---|---|---|---|---|---|
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:13:59 | Prod4 | 5.9 |
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:15:38 | Prod5 | 5.7 |
Loc16 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 402 | 2020-01-24 08:17:17 | Prod4 | 5.1 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-25 22:13:59 | Prod4 | 5.9 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-25 21:15:38 | Prod7 | 5.7 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 402 | 2020-01-26 02:17:17 | Prod2 | 5.1 |
Loc61 | 2020-01-24 05:00:00 | 2020-01-24 21:00:00 | 403 | 2020-01-24 09:17:17 | Prod4 | 5.1 |
Loc27 | 2020-01-25 21:00:00 | 2020-01-26 05:00:00 | 403 | 2020-01-25 21:13:59 | Prod5 | 5.9 |
Where the match is made first on Worker_ID then where the Production datetime falls in the datetime window of the the location.
This code works:
possible_matches = location_df.merge(production_df,on='Worker_ID',how='left')
build_df = possible_matches[(possible_matches['Production Time'] >= possible_matches['window_start']) &
(possible_matches['Production Time'] <= possible_matches['window_stop'])]
But does not work when there are millions of rows in the production dataframe and thousands of rows in the location dataframe.
Looking for a more efficient way of doing this join that actually works with large datasets with more workers and locations.