Get all rows where some column maximizes the overlap/intersection between several dataframes

Question

I have N dataframes where two of the columns hold longitude and latitude data, tracking the motion of a car. The general trace of the car is the same for all dataframes but since the tracking sometimes starts a bit late, or end a bit early, the length of the dataframes differ.

I want the dataframes to "line up", i.e. trim the rows which corresponds to "non overlapping positional data". I expect the result to be N dataframes of equal length s.t. the positional data is the same for all dataframes.

Example

Three arbitrary dataframes look like this:

time     speed     longitude      latitude
 t00       v00         19.70         48.67
 t01       v01         19.71         48.65
 t02       v02         19.72         48.64
 t03       v03         19.73         48.64
 t04       v04         19.74         48.63
 t05       v05         19.74         48.63
 t06       v06         19.75         48.64
 t07       v07         19.75         48.64
 t08       v08         19.75         48.64
 t09       v09         19.75         48.64


time     speed     longitude      latitude
 t10       v10         19.72         48.64
 t11       v11         19.73         48.64
 t12       v12         19.74         48.63
 t13       v13         19.74         48.63
 t14       v14         19.75         48.64
 t15       v15         19.75         48.64
 t16       v16         19.75         48.64



time     speed     longitude      latitude
 t20       v20         19.72         48.64
 t21       v21         19.73         48.64
 t22       v22         19.74         48.63
 t23       v23         19.74         48.63
 t24       v24         19.75         48.64
 t25       v25         19.75         48.63
 t26       v26         19.75         48.64
 t27       v27         19.75         48.64
 t28       v28         19.75         48.64

The result should be three new data frames:

time     speed     longitude      latitude
 t02       v02         19.72         48.64
 t03       v03         19.73         48.64
 t04       v04         19.74         48.63
 t05       v05         19.74         48.63
 t06       v06         19.75         48.64


time     speed     longitude      latitude
 t10       v10         19.72         48.64
 t11       v11         19.73         48.64
 t12       v12         19.74         48.63
 t13       v13         19.74         48.63
 t14       v14         19.75         48.64


time     speed     longitude      latitude
 t20       v20         19.72         48.64
 t21       v21         19.73         48.64
 t22       v22         19.74         48.63
 t23       v23         19.74         48.63
 t24       v24         19.75         48.64

In reality, the number of overlaying coordinates will be higher, but I hope this shows the gist of it.

I found this post where the intersection between two lists is retrieved. I tried to extract the positional data from the dataframe and then extract only the rows with matching coordinates from all dataframes, but this fails due to the number of rows differing between the dataframes.

My current code is seen below:

first_route = True
                                                                                                                                                       
for route in routes:  # extract all route's coordinates                                                                             
    lon = route["longitude"].values.tolist()                                                                                                           
    lat = route["latitude"].values.tolist()                                                                                                            
    if first_route:  # add first route regardless                                                                                          
        cropped_lon = lon                                                                                                                              
        cropped_lat = lat
        first_route = False                                                                                                                              
        continue                                                                                                                                       
    old_lon = collections.Counter(cropped_lon)                                                                                                         
    old_lat = collections.Counter(cropped_lat)                                                                                                         
    new_lon = collections.Counter(lon)                                                                                                                 
    new_lat = collections.Counter(lat)                                                                                                                 
    cropped_lon = list((old_lon & new_lon).elements())                                                                                                 
    cropped_lat = list((old_lat & new_lat).elements())                                                                                                 
                                                                                                                                                       
cropped_lon = np.asarray(cropped_lon)                                                                                                                  
cropped_lat = np.asarray(cropped_lat)                                                                                                                  

# THIS fails due to length difference
# Here I want to extract all rows which satisfy the positional restrictions                                                                                                                                                                                                  
for route in routes:                                                                                                                
    print(route[route.longitude == cropped_lon and route.latitude == cropped_lat])

I am totally willing to discard my whole idea if anyone has a better idea.

Update

The accepted answer solves the problem in the title, but I am looking for an extended solution. I hope it can be achieved in a similar way, why I leave it as an update.

My actual coordinate data has much higher resolution (6 decimals) but the measurements are not exact enough. The result is that the code in the accepted answer yields empty dataframes. I could take the shortest dataframe and then "slide" all other dataframes over it in order to do a least square fit, but I'm hoping there's a solution more similar to the one below.

Can you provide some sample data and desired output? In other words, a *[mcve]*. This will help us understand the problem and test solutions. — jpp, Mar 09 '18 at 16:49

score 1 · Accepted Answer · answered Mar 10 '18 at 15:46

You can merge all data frames to keep only the overlapping parts. Let's start with your sample data:

cols = ['time','speed']
group_cols = ['longitude','latitude']

input_list = [[['t00','v00',19.70,48.67],
    ['t01','v01',19.71,48.65],
    ['t02','v02',19.72,48.64],
    ['t03','v03',19.73,48.64],
    ['t04','v04',19.74,48.63],
    ['t05','v05',19.74,48.63],
    ['t06','v06',19.75,48.64],
    ['t07','v07',19.75,48.64],
    ['t08','v08',19.75,48.64],
    ['t09','v09',19.75,48.64]],

    [['t10','v10',19.72,48.64],
    ['t11','v11',19.73,48.64],
    ['t12','v12',19.74,48.63],
    ['t13','v13',19.74,48.63],
    ['t14','v14',19.75,48.64],
    ['t15','v15',19.75,48.64],
    ['t16','v16',19.75,48.64]],

    [['t20','v20',19.72,48.64],
    ['t21','v21',19.73,48.64],
    ['t22','v22',19.74,48.63],
    ['t23','v23',19.74,48.63],
    ['t24','v24',19.75,48.64],
    ['t25','v25',19.75,48.63],
    ['t26','v26',19.75,48.64],
    ['t27','v27',19.75,48.64],
    ['t28','v28',19.75,48.64]]]

import pandas as pd
df_list = [pd.DataFrame(l, columns=[c + str(i) for c in cols] + group_cols) for i, l in enumerate(input_list)]

Now to merge them:

from functools import reduce
df = reduce(
    lambda x, y: pd.merge(x, y, on=group_cols, how='inner'), 
    df_list)

    +-----+--------+---------+------------+-----------+--------+---------+--------+--------+
    |     | time0  | speed0  | longitude  | latitude  | time1  | speed1  | time2  | speed2 |
    +-----+--------+---------+------------+-----------+--------+---------+--------+--------+
    |  0  | t02    | v02     | 19.72      | 48.64     | t10    | v10     | t20    | v20    |
    |  1  | t03    | v03     | 19.73      | 48.64     | t11    | v11     | t21    | v21    |
    |  2  | t04    | v04     | 19.74      | 48.63     | t12    | v12     | t22    | v22    |
    |  3  | t04    | v04     | 19.74      | 48.63     | t12    | v12     | t23    | v23    |
    |  4  | t04    | v04     | 19.74      | 48.63     | t13    | v13     | t22    | v22    |
    |  5  | t04    | v04     | 19.74      | 48.63     | t13    | v13     | t23    | v23    |
    |  6  | t05    | v05     | 19.74      | 48.63     | t12    | v12     | t22    | v22    |
    |  7  | t05    | v05     | 19.74      | 48.63     | t12    | v12     | t23    | v23    |
    |  8  | t05    | v05     | 19.74      | 48.63     | t13    | v13     | t22    | v22    |
    |  9  | t05    | v05     | 19.74      | 48.63     | t13    | v13     | t23    | v23    |
    | 10  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 11  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 12  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 13  | t06    | v06     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 14  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 15  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 16  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 17  | t06    | v06     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 18  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 19  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 20  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 21  | t06    | v06     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    | 22  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 23  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 24  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 25  | t07    | v07     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 26  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 27  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 28  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 29  | t07    | v07     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 30  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 31  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 32  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 33  | t07    | v07     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    | 34  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 35  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 36  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 37  | t08    | v08     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 38  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 39  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 40  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 41  | t08    | v08     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 42  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 43  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 44  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 45  | t08    | v08     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    | 46  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t24    | v24    |
    | 47  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t26    | v26    |
    | 48  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t27    | v27    |
    | 49  | t09    | v09     | 19.75      | 48.64     | t14    | v14     | t28    | v28    |
    | 50  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t24    | v24    |
    | 51  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t26    | v26    |
    | 52  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t27    | v27    |
    | 53  | t09    | v09     | 19.75      | 48.64     | t15    | v15     | t28    | v28    |
    | 54  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t24    | v24    |
    | 55  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t26    | v26    |
    | 56  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t27    | v27    |
    | 57  | t09    | v09     | 19.75      | 48.64     | t16    | v16     | t28    | v28    |
    +-----+--------+---------+------------+-----------+--------+---------+--------+--------+

And finally:

df_list_out = [
    df[[c + str(i) for c in cols] + group_cols].drop_duplicates() for i in range(len(input_list))]

This seems to do the trick, however, is there a way to add some margin of error to the comparison? I.e. instead of merging on `x==y` could one merge on `abs(x-y) < c`? — GLaDER, Mar 11 '18 at 09:35
There is not `conditional join` in pandas, instead you'll have to do a cartesian product and then filter your data frame. It's hard to see how you want to implement this condition though, is it applicable to any pair of data frames within the list ? — MaFF, Mar 11 '18 at 13:55
I opted for a more manual approach where I let the shortest df be the frame to which I try to fit all other frames such that the difference in lon/lat is minimized. Thank you again though. Your answer solves the initial problem. — GLaDER, Mar 11 '18 at 13:57

Get all rows where some column maximizes the overlap/intersection between several dataframes

Example

Update

1 Answers1