I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.
You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):
cars
id start end
0 C1 200 215
1 C2 110 125
2 C3 240 255
...
trucks
id start end
0 T1 115 175
1 T2 200 260
2 T3 280 340
3 T4 25 85
...
The two dataframes above correspond to this:
start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.
The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:
- Back - it is in back of the car (
cars.end > trucks.end
) - Across - it is across from the car (
cars.start >= trucks.start and cars.end <= trucks.end
) - Front - it is in front of the car (
cars.start < trucks.start
)
I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:
- C1: Back = T1, Across = T2, Front = none (C3 is blocking)
- C2: Back = T4, Across = none, Front = T1
- C3: Back = none (C1 is blocking), Across = T2, Front = T3
The final output needs to be appended to the cars
dataframe along with the following new columns:
- data cross-referenced from the
trucks
dataframe - for back positions, the gap distance (
cars.start - trucks.end
) - for front positions, the gap distance (
trucks.start - cars.end
)
The final cars
dataframe should look like this:
id start end back_id back_start back_end back_distance across_id across_start across_end front_id front_start front_end front_distance
0 C1 200 215 T1 115 175 25 T2 200 260
1 C2 110 125 T4 25 85 25 T1 115 175 -10
2 C3 240 255 T2 200 260 T3 280 340 25
Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.