comparing columns in two separate pandas dataframes

Question

I have two dataframes, both of which contain columns of latitude and longitude. For each lat/lon entry in the first dataframe, I want to evaluate each lat/lon pair in the second dataframe to determine distance.

For example:

df1:                     df2:

     lat     lon              lat     lon 
0   38.32  -100.50       0   37.65   -97.87
1   42.51   -97.39       1   33.31   -96.40
2   33.45  -103.21       2   36.22  -100.01

distance between 38.32,-100.50 and 37.65,-97.87
distance between 38.32,-100.50 and 33.31,-96.40
distance between 38.32,-100.50 and 36.22,-100.01
distance between 42.51,-97.39 and 37.65,-97.87
distance between 42.51,-97.39 and 33.31,-96.40
...and so on...

I'm not sure how to go about doing this.

Thanks for the help!

score 3 · Answer 1 · answered Apr 03 '17 at 18:54

3

Euclidean Distance is calculated as

edpic

You can do this with your two dataframes like this

((df1 - df2) ** 2).sum(1) ** .5

0    2.714001
1    9.253113
2    4.232363
dtype: float64

answered Apr 03 '17 at 18:54

piRSquared

285,575
57
475
624

2

Euclidean distance doesn't really make sense as a direct measure between lat/long coordinates. – root Apr 03 '17 at 19:29

score 3 · Answer 2 · edited May 23 '17 at 12:25

UPDATE: as noted by @root it doesn't really make much sense to use Euclidean metric in this case, so let's use sklearn.neighbors.DistanceMetric

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

first we can build a DF with all combinations - (c) root:

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

vectorized "haversine" distance calculation

x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)

Result:

In [86]: x
Out[86]:
    lat1    lon1   lat2    lon2         dist
0  38.32 -100.50  37.65  -97.87   242.073182
1  38.32 -100.50  33.31  -96.40   667.993048
2  38.32 -100.50  36.22 -100.01   237.350451
3  42.51  -97.39  37.65  -97.87   541.605087
4  42.51  -97.39  33.31  -96.40  1026.006744
5  42.51  -97.39  36.22 -100.01   734.219411
6  33.45 -103.21  37.65  -97.87   671.274044
7  33.45 -103.21  33.31  -96.40   632.004981
8  33.45 -103.21  36.22 -100.01   424.140594

OLD answer:

IIUC you can use pairwise distance scipy.spatial.distance.pdist:

In [32]: from scipy.spatial.distance import pdist

In [43]: from itertools import combinations

In [34]: X = pd.concat([df1, df2])

In [35]: X
Out[35]:
     lat     lon
0  38.32 -100.50
1  42.51  -97.39
2  33.45 -103.21
0  37.65  -97.87
1  33.31  -96.40
2  36.22 -100.01

as Pandas.Series:

In [36]: s = pd.Series(pdist(X),
                       index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))

In [37]: s
Out[37]:
0  1     5.218065
   2     5.573240
   0     2.714001
   1     6.473801
   2     2.156409
1  2    10.768287
   0     4.883646
   1     9.253113
   2     6.813846
2  0     6.793791
   1     6.811439
   2     4.232363
0  1     4.582194
   2     2.573810
1  2     4.636831
dtype: float64

as Pandas.DataFrame:

In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
    df1  df2       dist
0     0    1   5.218065
1     0    2   5.573240
2     0    0   2.714001
3     0    1   6.473801
4     0    2   2.156409
5     1    2  10.768287
6     1    0   4.883646
7     1    1   9.253113
8     1    2   6.813846
9     2    0   6.793791
10    2    1   6.811439
11    2    2   4.232363
12    0    1   4.582194
13    0    2   2.573810
14    1    2   4.636831

Euclidean distance doesn't really make sense as a direct measure between lat/long coordinates. — root, Apr 03 '17 at 19:30
@root, that's interesting - 'haversine' metric gives us close, but not exactly the same distances, compared to 'vincenty' method ... — MaxU - stand with Ukraine, Apr 03 '17 at 20:23
Yes, the haversine formula is based on a spherical earth, but the earth isn't really a perfect sphere; it's slightly fatter around the equator (oblate spheroid). The vincenty formula accounts for this. In most cases haversine should be very close to vincenty, especially for relatively close distances. The major differences would occur for points on opposite sides of the globe (antipodal points). — root, Apr 03 '17 at 20:40
@root, thank you for that very detailed and clear explanation! — MaxU - stand with Ukraine, Apr 03 '17 at 20:42

score 3 · Answer 3 · edited May 23 '17 at 11:54

You can perform a cross join to get all combinations of lat/lon, then compute the distance using an appropriate measure. To do so, you can use the geopy package, which supplies geopy.distance.vincenty and geopy.distance.great_circle. Both should give valid distances, with vincenty giving more accurate results, but being computationally slower.

from geopy.distance import vincenty

# Function to compute distances.
def get_lat_lon_dist(row):
    # Store lat/long as tuples for input into distance functions.
    latlon1 = tuple(row[['lat1', 'lon1']])
    latlon2 = tuple(row[['lat2', 'lon2']])

    # Compute the distance.
    return vincenty(latlon1, latlon2).km

# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
         .drop('k', axis=1)

# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)

I used kilometers as my units in the example, but others can be specified, e.g.:

vincenty(latlon1, latlon2).miles

The resulting output:

    lat1    lon1   lat2    lon2     distance
0  38.32 -100.50  37.65  -97.87   242.709065
1  38.32 -100.50  33.31  -96.40   667.878723
2  38.32 -100.50  36.22 -100.01   237.080141
3  42.51  -97.39  37.65  -97.87   541.184297
4  42.51  -97.39  33.31  -96.40  1024.839512
5  42.51  -97.39  36.22 -100.01   733.819732
6  33.45 -103.21  37.65  -97.87   671.766908
7  33.45 -103.21  33.31  -96.40   633.751134
8  33.45 -103.21  36.22 -100.01   424.335874

Edit

As noted by @MaxU in the comments, you can use a numpy implementation of the Haversine formula in a similar manner for extra performance. This should be equivalent to the great_circle function in geopy.

I think you can use [vectorized haversine formula](http://stackoverflow.com/a/29546836/5741205) — MaxU - stand with Ukraine, Apr 03 '17 at 19:45
@MaxU: Thanks, I looked quickly for a haversine implementation but only found a pure python one. — root, Apr 03 '17 at 20:11

comparing columns in two separate pandas dataframes

3 Answers3

Linked