36

I am currently reading in data into a dataframe that looks like this.

City         XCord    YCord   
Boston         5        2
Phoenix        7        3
New York       8        1
.....          .        .

I want to to create a Euclidean Distance Matrix from this data showing the distance between all city pairs so I get a resulting matrix like:

             Boston    Phoenix   New York
Boston         0        2.236      3.162
Phoenix        2.236      0        2.236
New York       3.162    2.236        0

There are many more cities and coordinates in my actual data frame so i need to to be able to somehow iterate over all of the city pairs and create a distance matrix like the one I have shown above but I am not sure how to pair all of the cites together and apply the Euclidean Distance formula? Any help would be appreciated.

Jeremy
  • 779
  • 3
  • 9
  • 14
  • Do you have any code already? Please provide at least a code in which you read these distances into memory to have something like cords[boston] = (5, 2) – pkacprzak Apr 06 '15 at 23:59
  • Right now im reading a CSV file like this: Data = pd.read_csv('C:\Users\Jerry\Desktop\cities.csv') – Jeremy Apr 07 '15 at 00:25
  • [see also](https://stackoverflow.com/questions/52366421/how-to-do-n-d-distance-and-nearest-neighbor-calculations-on-numpy-arrays) – Daniel F Sep 18 '18 at 10:43

7 Answers7

52

I think you are intrested in distance_matrix.

For example:

Create data:

import pandas as pd
from scipy.spatial import distance_matrix
    
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

Output:

          xcord ycord
Boston      5   7
Phoenix     7   3
New York    8   1

Using the distance matrix function:

 pd.DataFrame(distance_matrix(df.values, df.values), index=df.index, columns=df.index)

Results:

          Boston    Phoenix     New York
Boston    0.000000  4.472136    6.708204
Phoenix   4.472136  0.000000    2.236068
New York  6.708204  2.236068    0.000000
Seanny123
  • 8,776
  • 13
  • 68
  • 124
Andrew
  • 3,711
  • 2
  • 20
  • 17
15

if you don't want to use scipy you can exploit list comprehension in this way:

dist = lambda p1, p2: sqrt(((p1-p2)**2).sum())
dm = np.asarray([[dist(p1, p2) for p2 in xy_list] for p1 in xy_list])
francesco lc
  • 159
  • 1
  • 5
  • This answer is super flexible and allows for the use of any distance function. Thanks for the share! – emil Oct 12 '22 at 07:26
8

I will give a method in pure python.

Import a sqrt function from math module:

from math import sqrt

Let assume that you have your coordinates in cords table in the following way:

cords['Boston'] = (5, 2)

Define a function to compute Euclidean distance of two given 2d points:

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1])

Initialize the resulting matrix as a dictionary:

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)

D is your resulting matrix

The full source is below along with printed result:

from math import sqrt

cords = {}
cords['Boston'] = (5, 2)
cords['Phoenix'] = (7, 3)
cords['New York'] = (8, 1)

def dist(a, b):
    d = [a[0] - b[0], a[1] - b[1]]
    return sqrt(d[0] * d[0] + d[1] * d[1]) 

D = {}

for city1, cords1 in cords.items():
    D[city1] = {}
    for city2, cords2 in cords.items():
        D[city1][city2] = dist(cords1, cords2)   

for city1, v in D.items():
    for city2, d in v.items():
        print city1, city2, d

Results:

Boston Boston 0.0
Boston New York 3.16227766017
Boston Phoenix 2.2360679775
New York Boston 3.16227766017
New York New York 0.0
New York Phoenix 2.2360679775
Phoenix Boston 2.2360679775
Phoenix New York 2.2360679775
Phoenix Phoenix 0.0
pkacprzak
  • 5,537
  • 1
  • 17
  • 37
2

This is a pure Python and numpy solution for generating a distance matrix.

Redundant computations can skipped (since distance is symmetric, distance(a,b) is the same as distance(b,a) and there's no need to compute the distance twice).

data = [[5, 7], [7, 3], [8, 1]]
cities = ['Boston', 'Phoenix', 'New York']

# Euclidean distance between two points
from math import sqrt
dist = lambda a,b: sqrt((a[0]-b[0])**2+(a[1]-b[1])**2)

import numpy as np
n = len(data)
dist_matrix = np.zeros((n,n))    # initialize distance matrix to a square of zeros
for i in range(n):
    for j in range(i, n):
        dist_matrix[i,j] = dist(data[i], data[j])
        dist_matrix[j,i] = dist_matrix[i,j]       # for the symmetric part, no computation

Now dist_matrix[i,j] is the distance between city[i] and city[j].

user2314737
  • 27,088
  • 20
  • 102
  • 114
1
data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)

n_df=(df.values)
n_df

(df.values).shape

matrix=np.zeros(((df.values).shape[0],(df.values).shape[0]))
matrix


for i in range((df.values).shape[0]):
    for j in range((df.values).shape[0]):
        matrix[i,j]=np.sqrt(np.sum((n_df[i]-n_df[j])**2))
        #print('i',i,'j',j)


print(matrix)
Surya Gaur
  • 11
  • 4
  • Can you describe for the OP why this approach improves upon or provides a good alternative to the already good answers to the question? – phalteman Aug 02 '19 at 23:00
  • What is a minimum reproducible example? https://stackoverflow.com/help/minimal-reproducible-example – Nikolas Rieble Aug 08 '19 at 11:17
0

There's the function in scipy: scipy.spatial.distance.cdist()

Maassa
  • 11
  • 4
0

Refer

import pandas as pd
import numpy as np

data = [[5, 7], [7, 3], [8, 1]]
ctys = ['Boston', 'Phoenix', 'New York']
df = pd.DataFrame(data, columns=['xcord', 'ycord'], index=ctys)
x, y = df.xcord.to_numpy(), df.ycord.to_numpy()
x_y = df.values
%%timeit
pd.DataFrame(
    np.hypot(
        np.subtract.outer(x, x),
        np.subtract.outer(y, y)
    ),
    index=df.index, columns=df.index
)
# 32.9 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
pd.DataFrame(distance_matrix(x_y, x_y), index=df.index, columns=df.index)
# 49.8 µs ± 330 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Also compared to normal custom written sqrt methods, hypot is more resistant to overflows and underflows

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200
eroot163pi
  • 1,791
  • 1
  • 11
  • 23