12

I am new to Python so this question might look trivia. However, I did not find a similar case to mine. I have a matrix of coordinates for 20 nodes. I want to compute the euclidean distance between all pairs of nodes from this set and store them in a pairwise matrix. For example, If I have 20 nodes, I want the end result to be a matrix of (20,20) with values of euclidean distance between each pairs of nodes. I tried to used a for loop to go through each element of the coordinate set and compute euclidean distance as follows:

ncoord=numpy.matrix('3225   318;2387    989;1228    2335;57      1569;2288  8138;3514   2350;7936   314;9888    4683;6901   1834;7515   8231;709   3701;1321    8881;2290   2350;5687   5034;760    9868;2378   7521;9025   5385;4819   5943;2917   9418;3928   9770')
n=20 
c=numpy.zeros((n,n))
for i in range(0,n):
    for j in range(i+1,n):
        c[i][j]=math.sqrt((ncoord[i][0]-ncoord[j][0])**2+(ncoord[i][1]-ncoord[j][1])**2)

How ever, I am getting an error of "input must be a square array ". I wonder if anybody knows what is happening here. Thanks

Juergen
  • 12,378
  • 7
  • 39
  • 55
Fairy
  • 379
  • 1
  • 5
  • 17
  • Please [edit] your question to include the definition of `ncoord`. Thanks for improving the question's reference value and making it more answerable! – Nathan Tuggy Feb 24 '15 at 02:50
  • What is your n? `for j in range(i+1,n-1)` will do `j=i+1, i+2, ..., n-2`. My guess is you want both those ranges to go up to `n`, not `n-1`. – MarkG Feb 24 '15 at 02:52
  • @MarkG yes I have 20 nodes (n=20) and I want both indices go up to n. I tried n instead of n-1 but I got the same error. I can easily code this in MATLAB but I have to use Python. Indexing in Python is different so I might be wrong. – Fairy Feb 24 '15 at 03:03
  • Then both your for loops should go up to n: `for i in range(0,n):`, and `for j in range(i+1,n):` If this isn't your bug you then will need to show more code. – MarkG Feb 24 '15 at 03:06
  • @MarkG yes this is not my bug. my code is what I mentioned in the main question. I have nothing more – Fairy Feb 24 '15 at 03:09
  • Why have you chosen to use a matrix rather than an array? Have you considered parsing the input data into the form np.array([ [x0,y0],[x1,y1] ... [xn,yn] ],dtype=([('x',' –  Feb 24 '15 at 03:12

4 Answers4

28

There are much, much faster alternatives to using nested for loops for this. I'll show you two different approaches - the first will be a more general method that will introduce you to broadcasting and vectorization, and the second uses a more convenient scipy library function.


  1. The general way, using broadcasting & vectorization

One of the first things I'd suggest doing is switching to using np.array rather than np.matrix. Arrays are preferred for a number of reasons, most importantly because they can have >2 dimensions, and they make element-wise multiplication much less awkward.

import numpy as np

ncoord = np.array(ncoord)

With an array, we can eliminate the nested for loops by inserting a new singleton dimension and broadcasting the subtraction over it:

# indexing with None (or np.newaxis) inserts a new dimension of size 1
print(ncoord[:, :, None].shape)
# (20, 2, 1)

# by making the 'inner' dimensions equal to 1, i.e. (20, 2, 1) - (1, 2, 20),
# the subtraction is 'broadcast' over every pair of rows in ncoord
xydiff = ncoord[:, :, None] - ncoord[:, :, None].T

print(xydiff.shape)
# (20, 2, 20)

This is equivalent to looping over every pair of rows using nested for loops, but much, much faster!

xydiff2 = np.zeros((20, 2, 20), dtype=xydiff.dtype)
for ii in range(20):
    for jj in range(20):
        for kk in range(2):
            xydiff[ii, kk, jj] = ncoords[ii, kk] - ncoords[jj, kk]

# check that these give the same result
print(np.all(xydiff == xydiff2))
# True

The rest we can also do using vectorized operations:

# we square the differences and sum over the 'middle' axis, equivalent to
# computing (x_i - x_j) ** 2 + (y_i - y_j) ** 2
ssdiff = (xydiff * xydiff).sum(1)

# finally we take the square root
D = np.sqrt(ssdiff)

The whole thing could be done in one line like this:

D = np.sqrt(((ncoord[:, :, None] - ncoord[:, :, None].T) ** 2).sum(1))

  1. The lazy way, using pdist

It turns out that there's already a fast and convenient function for computing all pairwise distances: scipy.spatial.distance.pdist.

from scipy.spatial.distance import pdist, squareform

d = pdist(ncoord)

# pdist just returns the upper triangle of the pairwise distance matrix. to get
# the whole (20, 20) array we can use squareform:

print(d.shape)
# (190,)

D2 = squareform(d)
print(D2.shape)
# (20, 20)

# check that the two methods are equivalent
print np.all(D == D2)
# True
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • This broadcasting is magical to me. How can I get some intuition about it? – sakimarquis Jul 26 '20 at 02:48
  • Thanks for the amazing method but it is still much slower than cross product whereas the complexity looks the same. – Jean Paul Feb 01 '21 at 14:16
  • I also had some memory issues when computing on a large matrix (1000 * 20000) with the method 1 which i did not have with the method 2 (scipy). – Jean Paul Feb 01 '21 at 17:58
5
for i in range(0, n):
    for j in range(i+1, n):
        c[i, j] = math.sqrt((ncoord[i, 0] - ncoord[j, 0])**2 
        + (ncoord[i, 1] - ncoord[j, 1])**2)

Note: ncoord[i, j] is not the same as ncoord[i][j] for a Numpy matrix. This appears to be the source of confusion. If ncoord is a Numpy array then they will give the same result.

For a Numpy matrix, ncoord[i] returns the ith row of ncoord, which itself is a Numpy matrix object with shape 1 x 2 in your case. Therefore, ncoord[i][j] actually means: take the ith row of ncoord and take the jth row of that 1 x 2 matrix. This is where your indexing problems comes about when j > 0.

Regarding your comments on assigning to c[i][j] "working", it shouldn't. At least on my build of Numpy 1.9.1 it shouldn't work if your indices i and j iterates up to n.

As an aside, remember to add the transpose of the matrix c to itself.

It is recommended to use Numpy arrays instead of matrix. See this post.

If your coordinates are stored as a Numpy array, then pairwise distance can be computed as:

from scipy.spatial.distance import pdist

pairwise_distances = pdist(ncoord, metric="euclidean", p=2)

or simply

pairwise_distances = pdist(ncoord)

since the default metric is "euclidean", and default "p" is 2.

In a comment below I mistakenly mentioned that the result of pdist is a n x n matrix. To get a n x n matrix, you will need to do the following:

from scipy.spatial.distance import pdist, squareform

pairwise_distances = squareform(pdist(ncoord))

or

from scipy.spatial.distance import cdist

pairwise_distances = cdist(ncoord, ncoord)
Community
  • 1
  • 1
lightalchemist
  • 10,031
  • 4
  • 47
  • 55
  • I actually did that but did not put it here. The last line of my code is: c[j][i]=c[i][j] – Fairy Feb 24 '15 at 03:13
  • Thank you it is working now. However I am misunderstood now. I thought when we want to call an element of a matrix in Python, we need to call it as a[][], but you are using a[,]. Why did you use the second format to read data from ncoord but saved the distance in c matrix by calling c elements like c[][]? – Fairy Feb 24 '15 at 03:17
  • Thank you so much for the complete information. I will try the other method you mentioned to see if I can get the same matrix size as the result ( I assume that pairwise_distances is a n*n matrix) – Fairy Feb 24 '15 at 03:20
  • Yes `pairwise_distances` will be a n x n matrix if n is the number of points you have. – lightalchemist Feb 24 '15 at 03:25
  • Thanks. But I still don't understand the difference between ncoord[i,j] and ncoord[i][j] – Fairy Feb 24 '15 at 03:29
  • @Fairy I've fixed the typo regarding assignment to an element of `c`. They should all use the same format `c[i, j]` rather than `c[i][j]`. – lightalchemist Feb 24 '15 at 03:30
  • c[i][j] is working ok. What is the difference between them?(c[i][j] and c[i,j]) – Fairy Feb 24 '15 at 03:32
  • @Fairy When you chain your indexing like that, what you are really doing is first grabbing the ith row of `c`, then indexing the jth element in the first dimension of that row. With an `np.array`, `c[i].shape == (n,)` (i.e. it has only one dimension that's `n` long) so indexing the jth element is no problem. However, `np.matrix` behaves differently in that `c[i].shape == (1, n)`. Since the singleton row dimension is kept, when you then try to index the jth element in the first dimension of `c[i]`, your index goes out of bounds. The take-home message is to use `np.array`, not `np.matrix`! – ali_m Feb 24 '15 at 04:21
1

What I figure you wanted to do: You said you wanted a 20 by 20 matrix... but the one you coded is triangular.

Thus I coded a complete 20x20 matrix instead.

distances = []
for i in range(len(ncoord)):
    given_i = []
    for j in range(len(ncoord)):
        d_val = math.sqrt((ncoord[i, 0]-ncoord[j,0])**2+(ncoord[i,1]-ncoord[j,1])**2)
        given_i.append(d_val)

    distances.append(given_i)

    # distances[i][j] = distance from i to j

SciPy way:

from scipy.spatial.distance import cdist
# Isn't scipy nice - can also use pdist... works in the same way but different recall method.
distances = cdist(ncoord, ncoord, 'euclidean')
Abhinav Ramakrishnan
  • 1,070
  • 12
  • 22
  • Thank you for your comment. I will try your method as well. – Fairy Feb 24 '15 at 03:28
  • 1
    Anytime you have to do a double loop through an array in numpy, you're losing the speed advantage afforded by NumPy in the firstplace. You want to broadcast whenever possible. However, for some operations, I think including this one, you can't broadcast because the values at each step depend on their neighbors. In these cases, SciPy solutions are usually optimized at the c-level (see cython), so they can still be much faster. I'd expect the cdist function to be much faster than the double loop. – Adam Hughes Feb 24 '15 at 03:51
0

Using your own custom sqrt sum sqaures is not always safe, they can overflow or underflow. Speed wise they are same

np.hypot(
    np.subtract.outer(x, x),
    np.subtract.outer(y, y)
)

Underflow

i, j = 1e-200, 1e-200
np.sqrt(i**2+j**2)
# 0.0

Overflow

i, j = 1e+200, 1e+200
np.sqrt(i**2+j**2)
# inf

No Underflow

i, j = 1e-200, 1e-200
np.hypot(i, j)
# 1.414213562373095e-200

No Overflow

i, j = 1e+200, 1e+200
np.hypot(i, j)
# 1.414213562373095e+200

Refer

eroot163pi
  • 1,791
  • 1
  • 11
  • 23