2

I have an assignment that can be done using any programming language. I chose Python and pandas since I have little experience using these and thought it would be a good learning experience. I was able to complete the assignment using traditional loops that I know from traditional computer programming, and it ran okay over thousands of rows, but it brought my laptop down to a screeching halt once I let it process millions of rows. The assignment is outlined below.

You have a two-lane road on a two-dimensional plane. One lane is for cars and the other lane is reserved for trucks. The data looks like this (spanning millions of rows for each table):

cars

   id  start  end
0  C1    200  215
1  C2    110  125
2  C3    240  255
...

trucks

   id  start  end
0  T1    115  175
1  T2    200  260
2  T3    280  340
3  T4     25   85
...

The two dataframes above correspond to this:

Road

start and end columns represent arbitrary positions on the road, where start = the back edge of the vehicle and end = the front edge of the vehicle.

The task is to identify the trucks closest to every car. A truck can have up to three different relationships to a car:

  1. Back - it is in back of the car (cars.end > trucks.end)
  2. Across - it is across from the car (cars.start >= trucks.start and cars.end <= trucks.end)
  3. Front - it is in front of the car (cars.start < trucks.start)

I emphasized "up to" because if there is another car in back or front that is closer to the nearest truck, then this relationship is ignored. In the case of the illustration above, we can observe the following:

  • C1: Back = T1, Across = T2, Front = none (C3 is blocking)
  • C2: Back = T4, Across = none, Front = T1
  • C3: Back = none (C1 is blocking), Across = T2, Front = T3

The final output needs to be appended to the cars dataframe along with the following new columns:

  • data cross-referenced from the trucks dataframe
  • for back positions, the gap distance (cars.start - trucks.end)
  • for front positions, the gap distance (trucks.start - cars.end)

The final cars dataframe should look like this:

   id  start  end  back_id  back_start  back_end  back_distance  across_id  across_start  across_end  front_id  front_start  front_end  front_distance
0  C1    200  215       T1         115       175             25         T2           200         260
1  C2    110  125       T4          25        85             25                                             T1          115        175             -10
2  C3    240  255                                                       T2           200         260        T3          280        340              25

Is pandas even the best tool for this task? If there is a better suited tool that is efficient at cross-referencing and appending columns based on some calculation across millions of rows, then I am all ears.

thdoan
  • 18,421
  • 1
  • 62
  • 57
  • 1
    If you already have a good loop `for` to do the job, have a look at the library Numba that could speed up your code. In pandas, [merge_asof](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html) could help you with the three different directions you can pass as parameter to look for your 3 relationships – Ben.T Mar 30 '20 at 01:24

1 Answers1

2

so with pandas, you can use merge_asof, here is one way, maybe not efficient with millions of rows:

#first sort values
trucks = trucks.sort_values(['start'])
cars = cars.sort_values(['start'])

#create back condition
df_back = pd.merge_asof(trucks.rename(columns={col:f'back_{col}' 
                                               for col in trucks.columns}), 
                        cars.assign(back_end=lambda x: x['end']), 
                        on='back_end', direction='forward')\
            .query('end>back_end')\
            .assign(back_distance=lambda x: x['start']-x['back_end'])

#create across condition: here note that cars is the first of the 2 dataframes
df_across = pd.merge_asof(cars.assign(across_start=lambda x: x['start']),
                          trucks.rename(columns={col:f'across_{col}' 
                                                 for col in trucks.columns}), 
                          on=['across_start'], direction='backward')\
              .query('end<=across_end')

#create front condition
df_front = pd.merge_asof(trucks.rename(columns={col:f'front_{col}' 
                                                for col in trucks.columns}), 
                         cars.assign(front_start=lambda x: x['start']), 
                         on='front_start', direction='backward')\
             .query('start<front_start')\
             .assign(front_distance=lambda x: x['front_start']-x['end'])

# merge all back to cars
df_f = cars.merge(df_back, how='left')\
           .merge(df_across, how='left')\
           .merge(df_front, how='left')

and you get

print (df_f)
   id  start  end back_id  back_start  back_end  back_distance  across_start  \
0  C2    110  125      T4        25.0      85.0           25.0           NaN   
1  C1    200  215      T1       115.0     175.0           25.0         200.0   
2  C3    240  255     NaN         NaN       NaN            NaN         240.0   

  across_id  across_end front_id  front_start  front_end  front_distance  
0       NaN         NaN       T1        115.0      175.0           -10.0  
1        T2       260.0      NaN          NaN        NaN             NaN  
2        T2       260.0       T3        280.0      340.0            25.0  
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 1
    Ben.T, thanks so much for this. Your solution is so much more concise than mine. My solution involved looping through array, using `find_nearest()` in https://codereview.stackexchange.com/a/239946/221419 to find nearest car & truck values for each item, then create new columns from there. I'm finding out that loops are bad when you are dealing with millions of rows. I haven't timed it, yet, but I have a feeling your's would be faster because it avoids loops. I will try to implement ideas from https://blog.paperspace.com/faster-numpy-array-processing-ndarray-cython/ to improve on your solution. – thdoan Apr 05 '20 at 01:07
  • 1
    I'm still learning about `pd.merge_asof()` -- one thing that I can't figure out is that how come when I print the final dataframe, the `across_id` column comes after the `across_start` column? The back and front columns are fine, it's just the across condition that is not in the expected order. – thdoan Apr 05 '20 at 01:18
  • 1
    @thdoan so I would say the difference in column order is because in front and back `trucks` is the left df and `cars` the right one in the `merge_asof` while for across it is the opposite – Ben.T Apr 05 '20 at 01:58
  • 1
    @thdoan regarding the function `find_nearest`, with a quick look, it might possible to vectorize it, maybe what is called `number` could also be an array and then you could do [`ufunc.outer`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.outer.html). That said, with both dataframes with millions of rows, it won't be possible because the result would be too big for standard computer memory. – Ben.T Apr 05 '20 at 02:06
  • I don't have experience with vectorization -- would it be possible to vectorize if the `cars` and `trucks` tables are not the same? Sorry, I forgot to mention that in the real data set, these two tables have different number of rows as well as different columns (although they do share `id`, `start`, and `end` among them). Regarding different columns, that can be rectified by creating new dataframes with common columns, but the different number of rows I'm not sure about (I had the impression that in order to vectorize you need the same number of rows since it's similar to a matrix operation). – thdoan Apr 05 '20 at 20:28