14

I have a .csv file that contains city, latitude and longitude data in the below format:

CITY|LATITUDE|LONGITUDE
A|40.745392|-73.978364
B|42.562786|-114.460503
C|37.227928|-77.401924
D|41.245708|-75.881241
E|41.308273|-72.927887

I need to create a distance matrix in the below format (please ignore the dummy values):

         A         B         C         D         E   
A  0.000000  6.000000  5.744563  6.082763  5.656854  
B  6.000000  0.000000  6.082763  5.385165  5.477226  
C  1.744563  6.082763  0.000000  6.000000  5.385165
D  6.082763  5.385165  6.000000  0.000000  5.385165  
E  5.656854  5.477226  5.385165  5.385165  0.000000  

I have loaded the data into a pandas dataframe and have created a cross join as below:

import pandas as pd
df_A = pd.read_csv('lat_lon.csv', delimiter='|', encoding="utf-8-sig")
df_B = df_A
df_A['key'] = 1
df_B['key'] = 1 
df_C = pd.merge(df_A, df_B, on='key')  
  • Can you please help me create the above matrix structure?
  • Also, is it possible to avoid step involving cross join?
IanS
  • 15,771
  • 9
  • 60
  • 84
Abacus
  • 197
  • 1
  • 2
  • 6
  • 1
    Do you have the algorithm to calculate the Euclidean distance? – StefanS Aug 29 '16 at 11:26
  • Possible duplicate of [Distance matrix for rows in pandas dataframe](http://stackoverflow.com/questions/29723560/distance-matrix-for-rows-in-pandas-dataframe) – IanS Aug 29 '16 at 11:29
  • @IanS, IMO it's not quite the same question (and the provided answer for the Euclidean distance is not the fastest one - as it uses the `apply` method) – MaxU - stand with Ukraine Aug 29 '16 at 12:13
  • The distance calculation itself is not hard once one has the three-dimensional coordinates of the cities, but since we don't have elevation information, we either need to assume a sphere or some form of geoid. The hard part is not really a programming question, hence me asking whether the OP has the algorithm already. – StefanS Aug 29 '16 at 15:02

3 Answers3

20

You can use pdist and squareform methods from scipy.spatial.distance:

In [12]: df
Out[12]:
  CITY   LATITUDE   LONGITUDE
0    A  40.745392  -73.978364
1    B  42.562786 -114.460503
2    C  37.227928  -77.401924
3    D  41.245708  -75.881241
4    E  41.308273  -72.927887

In [13]: from scipy.spatial.distance import squareform, pdist

In [14]: pd.DataFrame(squareform(pdist(df.iloc[:, 1:])), columns=df.CITY.unique(), index=df.CITY.unique())
Out[14]:
           A          B          C          D          E
A   0.000000  40.522913   4.908494   1.967551   1.191779
B  40.522913   0.000000  37.440606  38.601738  41.551558
C   4.908494  37.440606   0.000000   4.295932   6.055264
D   1.967551  38.601738   4.295932   0.000000   2.954017
E   1.191779  41.551558   6.055264   2.954017   0.000000
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 3
    I've never used `pdist`, but according to the documentation it uses 2d Euclidean norm, which requires coordinates to be on a plane and have distance units, both of which are not true for longitude and latitude values. – StefanS Aug 29 '16 at 15:02
  • 2
    @StefanS, OP wants to have `Euclidean Distance` - which is pretty well defined and is a default method in `pdist`, if you or OP wants another method (minkowski, cityblock, seuclidean, sqeuclidean, cosine, correlation, hamming, jaccard, chebyshev, canberra, etc. - there are altogether 22 different metrics) you can simply specify it as a `metric` argument – MaxU - stand with Ukraine Aug 29 '16 at 15:08
  • Well, only the OP can really know what he wants. But Euclidean distance is well defined. If you have latitude and longitude on a sphere/geoid, you first need actual coordinates in a measure of length, otherwise your "distance" will depend not only on the relative distance of the points, but also on the absolute position on the sphere (towards the poles the same angle-distance becomes less length-distance). In any case, this part is not a programming question. Once we know which algorithm the OP wants to translate from lon/lat to 3D-coordinates, we can build his matrix. – StefanS Aug 29 '16 at 20:07
  • would "vincenty" be an option? – Sergio Lucero Apr 29 '18 at 18:50
  • @SergioLucero, sure one could use "vincenty" or [haversine](https://stackoverflow.com/a/43577275/5741205) formula, but it wasn't the main "issue" of the question... – MaxU - stand with Ukraine Apr 29 '18 at 18:54
2

the matrix can be directly created with cdist in scipy.spatial.distance:

from scipy.spatial.distance import cdist
df_array = df[["LATITUDE", "LONGITUDE"]].to_numpy()
dist_mat = cdist(df_array, df_array)
pd.DataFrame(dist_mat, columns = df["CITY"], index = df["CITY"])
simplyPTA
  • 343
  • 4
  • 11
1
for i in df["CITY"]:
    for j in df["CITY"]:
        row = df[df["CITY"] == j][["LATITUDE", "LONGITUDE"]]
        latitude = row["LATITUDE"].tolist()[0]
        longitude = row["LONGITUDE"].tolist()[0]
        df.loc[df['CITY'] == i, j] = ((df["LATITUDE"] - latitude)**2 + (df["LONGITUDE"] - longitude)**2)**0.5

df = df.drop(["CITY", "LATITUDE", "LONGITUDE"], axis=1)

This works

Himaprasoon
  • 2,609
  • 3
  • 25
  • 46
  • No, this calculates the Euclidean norm of the longitude and latitude values (which are in degrees of angle). That is not the Euclidean distance. – StefanS Aug 29 '16 at 14:51
  • @StefanS How is it different from each other here in 2D? (norm and distance formulas ) – Himaprasoon Aug 29 '16 at 18:45