Pandas Merge On < Operator

Question

I have a df with flight times from A -> B/C and flights from B/C -> Z.

I'd like to find possible one-stop routings between A->Z. These would require a stop at B/C.

A simple merge does the trick.

routes = pd.merge(df , df , left_on = 'Destination' , right_on = 'Origin')

However, my schedule df has thousands of flights from A -> B/C/D/.../Y and thousands of flights from B/C/D/.../Y->Z. Merging this table with itself results in a routes table that is billions of rows long. I can filter to a small df by filtering to routings with a layover of <24 hours.

routes = routes[routes['Time_y'] - routes['Time_x'] < 24]

To first compute all possible routings and then filter to those with a layover < 24 hours is computationally infeasible. Is there any way to perform the pd.merge simultaneously with the condition that layover be < 24 hours? The pseudocode would be:

routes = pd.merge(df , df , left_on = 'Destination' , right_on = 'Origin' , right['Time'] - left['Time'] < 24)

Here's sample data:

df = pd.DataFrame({'Origin': {1: 'A', 6: 'A', 11: 'A', 16: 'A', 21: 'B', 26: 'B', 31: 'C', 36: 'C'}, 'Destination': {1: 'B', 6: 'B', 11: 'C', 16: 'C', 21: 'Z', 26: 'Z', 31: 'Z', 36: 'Z'}, 'Dept_Time': {1: pd.Timestamp('2019-03-30 17:31:00'), 6: pd.Timestamp('2019-05-16 17:32:00'), 11: pd.Timestamp('2019-04-01 08:30:00'), 16: pd.Timestamp('2019-06-09 08:20:00'), 21: pd.Timestamp('2019-07-26 08:31:00'), 26: pd.Timestamp('2019-03-31 06:16:00'), 31: pd.Timestamp('2019-07-03 23:52:00'), 36: pd.Timestamp('2019-03-27 17:31:00')}, 'Arrv_Time': {1: pd.Timestamp('2019-03-30 23:23:00'), 6: pd.Timestamp('2019-05-16 23:22:00'), 11: pd.Timestamp('2019-04-01 14:22:00'), 16: pd.Timestamp('2019-06-09 14:18:00'), 21: pd.Timestamp('2019-07-26 14:23:00'), 26: pd.Timestamp('2019-06-18 05:00:00'), 31: pd.Timestamp('2019-07-04 05:36:00'), 36: pd.Timestamp('2019-03-27 23:23:00')}})

To the best if my knowledge, there is no such operation. You can divide the table into N parts, merge each part with each part, and then filter and merge the results. This will take N^2 merges, but, with a carefully chosen N, may be computationally feasible. — DYZ, Feb 13 '19 at 20:42
@DYZ Thanks. I suspect that this will preserve memory but slow down the overall runtime. Runtime is what I'm looking to improve. I'll test and report. — Jack, Feb 13 '19 at 20:52
Don't think there's a way to add such a condition to the merge in pandas but the top answer [here](https://stackoverflow.com/questions/23508351/how-to-do-a-conditional-join-in-python-pandas) might help — kudeh, Feb 13 '19 at 21:14
@JoergVanAken Thanks - good idea. df covers 300 days. One problem I can think of is this will create layovers of 24-48 hours... let leg_1 = df[df['time'] == '2019-01-01'] and leg_2 = df[df['time'] == '2019-01-02']. pd.merge(leg_1 , leg_2) will create an itinerary with flight 1 leaving early 2019-01-01 with flight 2 leaving late 2019-01-02 which makes the layover between 24-48 hours. — Jack, Feb 13 '19 at 21:24
What is your date time types? Does it contain `HH:MM:SS`? Please post sample data directly in question. — Parfait, Feb 13 '19 at 21:38
@Parfait Timestamps. I just posted sample data. Notice that with the sample data there are 8 possible routings when layover time is ignored. When the constraint that layover time is < 24 hours, the only viable routing is A -> B departing @ 2019-03-30 17:31:00 and then B->Z departing @ 2019-03-31 06:16:00 — Jack, Feb 13 '19 at 22:06
@Jack, interested in how you solved this question. If you can as well, kindly add your solution, as well as the expected output — sammywemmy, Nov 02 '21 at 07:16

score 0 · Answer 1 · answered Feb 13 '19 at 22:08

Consider calculating a future (1-day ahead) date and add that to the merge with Origin/Destination:

from datetime import timedelta
...

df['Future_Time'] = df['Flight_Time'] + timedelta(days=1)

merged_df = pd.merge(df, df, 
                     left_on = ['Destination', 'Future_Time'],
                     right_on = ['Origin', 'Flight_Time'],
                     suffixes=['_o','_d'])

Assuming your flight times have time components affecting the merge joins, normalize the dates to last midnight before calculated future date and merge:

df['Normalized_Time'] = df['Flight_Time'].dt.normalize()
df['Future_Time'] = df['Normalized_Time'] + timedelta(days=1)

merged_df = pd.merge(df, df, 
                     left_on = ['Destination', 'Future_Time'],
                     right_on = ['Origin', 'Normalized_Time'],
                     suffixes=['_o','_d'])

Above may lead to the 24-48 hour pairing issue. But from there you can filter according to 24-hour span on a smaller set:

merged_df[merge_df['Flight_Time_d'] <= merge_df['Flight_Time_o'] + timedelta(days=1)]

Or all together:

merged_df = (pd.merge(df, df, 
                      left_on = ['Destination', 'Future_Time'],
                      right_on = ['Origin', 'Normalized_Time'],
                      suffixes=['_o','_d'])
               .query("Flight_Time_d <= Flight_Time_o + timedelta(days=1)")
            )

Pandas Merge On < Operator

1 Answers1