4

I need to efficiently calculate the euclidean weighted distances for every x,y point in a given array to every other x,y point in another array. This is the code I have which works as expected:

import numpy as np
import random

def rand_data(integ):
    '''
    Function that generates 'integ' random values between [0.,1.)
    '''
    rand_dat = [random.random() for _ in range(integ)]

    return rand_dat

def weighted_dist(indx, x_coo, y_coo):
    '''
    Function that calculates *weighted* euclidean distances.
    '''
    dist_point_list = []
    # Iterate through every point in array_2.
    for indx2, x_coo2 in enumerate(array_2[0]):
        y_coo2 = array_2[1][indx2]
        # Weighted distance in x.
        x_dist_weight = (x_coo-x_coo2)/w_data[0][indx] 
        # Weighted distance in y.
        y_dist_weight = (y_coo-y_coo2)/w_data[1][indx] 
        # Weighted distance between point from array_1 passed and this point
        # from array_2.
        dist = np.sqrt(x_dist_weight**2 + y_dist_weight**2)
        # Append weighted distance value to list.
        dist_point_list.append(round(dist, 8))

    return dist_point_list


# Generate random x,y data points.
array_1 = np.array([rand_data(10), rand_data(10)], dtype=float)

# Generate weights for each x,y coord for points in array_1.
w_data = np.array([rand_data(10), rand_data(10)], dtype=float)

# Generate second larger array.
array_2 = np.array([rand_data(100), rand_data(100)], dtype=float)


# Obtain *weighted* distances for every point in array_1 to every point in array_2.
dist = []
# Iterate through every point in array_1.
for indx, x_coo in enumerate(array_1[0]):
    y_coo = array_1[1][indx]
    # Call function to get weighted distances for this point to every point in
    # array_2.
    dist.append(weighted_dist(indx, x_coo, y_coo))

The final list dist holds as many sub-lists as points are in the first array with as many elements in each as points are in the second one (the weighted distances).

I'd like to know if there's a way to make this code more efficient, perhaps using the cdist function, because this process becomes quite expensive when the arrays have lots of elements (which in my case they have) and when I have to check the distances for lots of arrays (which I also have)

Gabriel
  • 40,504
  • 73
  • 230
  • 404

4 Answers4

6

@Evan and @Martinis Group are on the right track - to expand on Evan's answer, here's a function that uses broadcasting to quickly calculate the n-dimensional weighted euclidean distance without Python loops:

import numpy as np

def fast_wdist(A, B, W):
    """
    Compute the weighted euclidean distance between two arrays of points:

    D{i,j} = 
    sqrt( ((A{0,i}-B{0,j})/W{0,i})^2 + ... + ((A{k,i}-B{k,j})/W{k,i})^2 )

    inputs:
        A is an (k, m) array of coordinates
        B is an (k, n) array of coordinates
        W is an (k, m) array of weights

    returns:
        D is an (m, n) array of weighted euclidean distances
    """

    # compute the differences and apply the weights in one go using
    # broadcasting jujitsu. the result is (n, k, m)
    wdiff = (A[np.newaxis,...] - B[np.newaxis,...].T) / W[np.newaxis,...]

    # square and sum over the second axis, take the sqrt and transpose. the
    # result is an (m, n) array of weighted euclidean distances
    D = np.sqrt((wdiff*wdiff).sum(1)).T

    return D

To check that this works OK, we'll compare it to a slower version that uses nested Python loops:

def slow_wdist(A, B, W):

    k,m = A.shape
    _,n = B.shape
    D = np.zeros((m, n))

    for ii in xrange(m):
        for jj in xrange(n):
            wdiff = (A[:,ii] - B[:,jj]) / W[:,ii]
            D[ii,jj] = np.sqrt((wdiff**2).sum())
    return D

First, let's make sure that the two functions give the same answer:

# make some random points and weights
def setup(k=2, m=100, n=300):
    return np.random.randn(k,m), np.random.randn(k,n),np.random.randn(k,m)

a, b, w = setup()
d0 = slow_wdist(a, b, w)
d1 = fast_wdist(a, b, w)

print np.allclose(d0, d1)
# True

Needless to say, the version that uses broadcasting rather than Python loops is several orders of magnitude faster:

%%timeit a, b, w = setup()
slow_wdist(a, b, w)
# 1 loops, best of 3: 647 ms per loop

%%timeit a, b, w = setup()
fast_wdist(a, b, w)
# 1000 loops, best of 3: 620 us per loop
ali_m
  • 71,714
  • 23
  • 223
  • 298
  • I have no idea what 90% of this code does but it sure does is fast :) Give me a couple of hours to check it and I'll come back to either mark it as accepted or comment on it. Oh and I've proposed an edit to your answer, please check it and see if it is right. Thank you! – Gabriel Oct 10 '13 at 01:24
  • Are you happy that you understand what this function does? The weights are applied to the differences in _x,y,..._ coordinates for A and B before squaring, the same as in your code. The _i,j_ th point in the output array is the weighted euclidean distance between the _i_ th point in A and the _j_ th point in B. – ali_m Oct 10 '13 at 08:28
  • I don't understand why you added a `.sum()` function to sum all the distances. My code doesn't do that, was it a slip or am I getting it all wrong? – Gabriel Oct 10 '13 at 11:20
  • 1
    My version works for the _k_ dimensional case, and the array of weighted differences called `wdiff` in my code is (_n,k,m_). To compute the euclidean norm, you take the square root of the sum of squared differences across dimensions, e.g. in the 2D case _sqrt((x0-x1)^2 + (y0-y1)^2)_. Summing over _k_ is just the generalisation of this to an arbitrary number of dimensions. So yes, you are already doing this when you do `dist = np.sqrt(x_dist_weight**2 + y_dist_weight**2)`. It might be a bit more obvious if I broke the operations down a bit more, but I was aiming for speed over readability. – ali_m Oct 10 '13 at 11:27
  • Oh got it now, sorry I missed that. I'm testing the code right now, I'll come back as soon as possible to either mark it as accepted or comment further. Thanks for your patience! – Gabriel Oct 10 '13 at 11:36
  • 1
    Ok I tested it with my code and let me tell you: **WOW**. Not only is your answer ~120x faster than my original code, it's more than 2x faster than calculating the *non weighted* distances using `scipy.spatial.distance.cdist`. I definitely have to learn more about broadcasting. Thank you very much ali_m! – Gabriel Oct 10 '13 at 13:19
3

You could use cdist if you don't need weighted distances. If you need weighted distances and performance, create an array of the appropriate output size, and use either an automated accelerator like Numba or Parakeet, or hand-tune the code with Cython.

Aron Ahmadia
  • 2,267
  • 2
  • 19
  • 22
2

You can avoid looping by using code that looks like the following:

def compute_distances(A, B, W):
    Ax = A[:,0].reshape(1, A.shape[0])
    Bx = B[:,0].reshape(A.shape[0], 1)
    dx = Bx-Ax

    # Same for dy
    dist = np.sqrt(dx**2 + dy**2) * W
    return dist

That will run a lot faster in python that anything that loops as long as you have enough memory for the arrays.

Evan
  • 2,217
  • 15
  • 18
  • That is not exactly what I need. Note how in my code the weights are applied *inside* the square root, once for the x distance and another for the y distance. – Gabriel Oct 09 '13 at 17:03
  • 1
    That is a fairly easy change to make. The point is to use broadcasting and uops to avoid looping constructs. In your case it looks like the weight arrays will have the same dimension as 'A', so you reshape them accordingly and multiply dx and dy by their individual weight vectors. – Martinis Group Oct 09 '13 at 22:33
0

You could try removing the square root, since if a>b, it follows that a squared > b squared... and computers are REALLY slow at square roots normally.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Why is this marked down please? Suppose you are looking for all points less than 10 metres apart, it is just as valid to test if the un-square-rooted distance is less than 100 metres as it is to test if the square-rooted distance is less than 10 metres - and it saves all the square root operations. Likewise, if you want to find the 10 nearest pairs of coordinates, I can assure you that the 10 with the smallest squared distances will be exactly the same ones. Within the constraints of the original question, my answer is perfectly valid. – Mark Setchell Oct 10 '13 at 08:02
  • It depends on what you want to use the distance measure for - if you care about the relative distances between a0 and b, and a1 and b, then clearly squared distance is a very different metric to distance. Besides, _by far_ the biggest performance gains to be made are from getting rid of Python loops. – ali_m Oct 10 '13 at 09:17