Numpy distance calculations of different shaped arrays

Question

Not sure I titled this well, but basically I have a reference coordinate, in the format of (x,y,z), and a large list/array of coordinates also in that format. I need to get the euclidean distance between each, so with numpy and scipy in theory I should be able to do an operation such as:

import numpy, scipy.spatial.distance
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)

distances = scipy.spatial.distance.euclidean(b, a)

But instead of getting an array back I get an error: ValueError: Input vector should be 1-D.

Not sure how to resolve this error and get what I want without having to resort to loops and such, which sort of defeats the purpose of using Numpy.

Long term I want to use these distances to calculate truth masks for counting distance values in bins.

I'm not sure if I'm just using the function wrong or using the wrong function, I haven't been able to find anything in the documentation that would work better.

see this related post https://stackoverflow.com/q/27948363/2588210 — Christian K., Jul 09 '18 at 16:35
and have a look at the docs: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance — Christian K., Jul 09 '18 at 16:39
Which documentation are you reading? The docs at https://docs.scipy.org/doc/scipy-1.0.0/reference/spatial.distance.html seem pretty clear that this function takes 1D inputs. — user2699, Jul 09 '18 at 16:49
It is really easy to implement problems like this if you can use a compiler. eg. outperforming cdist https://stackoverflow.com/a/49573091/4045774 Also be precise what you have as input (a list is very different from an array). Some example on bincounting/ digitize https://stackoverflow.com/a/50496823/4045774 — max9111, Jul 09 '18 at 21:42
@ChristianK. your link doesn't work, but I reviewed the spatial library before I came here, and have in fact using `cdist` and `pdist` quite a bit. @max9111 I don't have a list as I explicitly converted them to arrays, some scipy things work with lists, some do not, I always err on the site of going for an array when I can. — Will, Jul 10 '18 at 14:18
@max9111 also wanted to ask, when you say compiler I assume you mean explicitly writing the code in another language, as opposed to python? I make extensive use of matplotlib and numpy, I'm not even sure if they ported to any other language, so I tend to stick with the scripting when I need to use them. Any advice regarding that is very welcome though. — Will, Jul 10 '18 at 14:19
@Will I'll add a small explanation to the compiler mentioned in max9111's post. — JE_Muc, Jul 10 '18 at 14:25
@Will Take for example a look on clang. It translates C-Code to LLVM-IR and after this step the LLVM backend produces machine-code. Numba for example translates Python code to LLVM-IR which will be compiled to machine-code by the LLVM backend. Cython translates Python code to C and the machine-code is than created by a C-compiler. Only a small subset Python Language or a very small subset of modules is supported by Numba, but it is a nice tool to improve the performance of critical codepaths. — max9111, Jul 10 '18 at 14:55

JE_Muc · Accepted Answer · 2018-07-10T14:32:31.150

The documentation of scipy.spatial.distance.euclidean states, that only 1D-vectors are allowed as inputs. Thus you must loop over your arrays like:

distances = np.empty(b.shape[0])
for i in range(b.shape[0]):
    distances[i] = scipy.spatial.distance.euclidean(a, b[i])

If you want to have a vectorized implementation, you need to write your own function. Perhaps using np.vectorize with a correct signature will also work, but this is in fact also just a short-hand for a for-loop and will thus have the same performance as a simple for-loop.

As stated in my comment to hannes wittingham's solution, I'll post a one-liner which is focussing on performance:

distances = ((b - a)**2).sum(axis=1)**0.5

Writing out all the calculations reduces the number of separate functions calls and thus assignments of the intermediate results to new arrays. Thus it is about 22% faster than using the solution of hannes wittingham for an array shape of b.shape == (20, 3) and about 5% faster for an array shape of b.shape == (20000, 3):

a = np.array([1, 1, 1,])
b = np.random.rand(20, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 5.37 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit euclidean_distances(a, b)
# 6.89 µs ± 345 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

b = np.random.rand(20000, 3)
%timeit ((b - a)**2).sum(axis=1)**0.5
# 588 µs ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit euclidean_distances(a, b)
# 616 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But your are losing the flexibility of being able to easily change to distance calculation routine. When using the scipy.spatial.distance module, you can change the calculation routing by simply calling another method.

To improve the calculation performance even further, you can use a jit (just in time) compiler like numba for your functions:

import numba as nb
@nb.njit
def euc(a, b):
    return ((b - a)**2).sum(axis=1)**0.5

This reduces the time needed to do the calculations by about 70% for small arrays and by about 60% for large arrays. Unluckily the axis keyword for np.linalg.norm is not yet supported by numba.

I like this very much, and I see your point about losing flexibility. I found if I use numpy.linalg I can do what I'm looking for, as `distances = numpy.linalg(a - b, axis=1)` and this gets me what I would expect, it's not as flexible as the scipy method for sure, but it is one fairly quick line. So I think it will be situation whether I use one or the other. I'm going to post it separately for reference. — Will, Jul 10 '18 at 14:16
I guess you mean `numpy.linalg.norm`? It is indeed a good way to do the calculations. I'm using this as well when my code is not performance critical (well, in Python it generally never should be, but saving some time is always nice :) ). Don't know why I forgot to mention it in my answer. `np.linalg.norm` should be, depending on the array size, about 5%-30% slower than `((b - a)**2).sum(axis=1)**0.5`. And thanks alot for marking my answer, I'm glad I was able to help. — JE_Muc, Jul 10 '18 at 14:24
@scotty Writing Numba functions in a vectorized way isn't really recommendable. Sometimes it works as fast as simple loops (if Numba can translate it to this), but many times it is slower. Also @nb.njit(fastmath=True) may be beneficial (often it is necessary for SIMD vectorization) — max9111, Jul 10 '18 at 15:00
True, but since this question is not about numba or highly optimizing the performance of code, I wanted to just post a simple and short example of what was meant with compiling. And in this case the performance gain is negligible compared to the additional effort. — JE_Muc, Jul 10 '18 at 15:06
@Scotty1- that's a good point, 30% is a big change in calculation time, I'll have to give that a try and see if it will give me a faster run as my data set is fairly large. — Will, Jul 11 '18 at 15:49
@max9111 what's nb.njit? I'm only peripherally aware of fastmath libraries, but if it's a place I should be looking I would love a bit more information if you don't mind. — Will, Jul 11 '18 at 15:49

score 3 · Answer 2 · answered Jul 09 '18 at 17:34

3

It's not actually too hard to write your own function to do this - here's mine, which you're welcome to use.

If you are carrying out this operation over a large number of points and speed matters, I would guess this function will beat a for-loop based solution for speed by a long way - numpy is designed to be efficient when carrying out operations on a whole matrix.

import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)

def euclidean_distances(ref_point, co_ords_array):
    diffs = co_ords_array - ref_point
    sqrd_diffs = numpy.square(diffs)
    sum_sqrd_diffs = numpy.sum(sqrd_diffs, axis = 1)
    euc_dists = numpy.sqrt(sum_sqrd_diffs)
    return euc_dists

answered Jul 09 '18 at 17:34

hannes whittingham

106
7

I fully support your solution and thus gave you an upvote, since I also prefer to write such calculations on my own. As shown in my [solution to this question](https://stackoverflow.com/a/51077781/6345518), writing out the calculations is even more efficient (and it can be condensed to a one-liner). I added this to my solution here. The reason why I used the `scipy.spatial.distance` module is one major advantage of it: **you can easily and flexibly use all kinds of spatial distance calculations by just changing the function call**. – JE_Muc Jul 10 '18 at 09:08
I like this a lot, though I agree with @Scotty1-, being able to use the distances library is of good value. I try to not rewrite code when possible, but sometimes it is necessary, even desirable. And it can be very satisfying depending on the work being done. Thanks. – Will Jul 10 '18 at 14:22

score 1 · Answer 3 · answered Jul 10 '18 at 14:21

This code will get the euclidean norm which should work in many cases, and is fairly quick, and one line. Other methods are more efficient or flexible depending on the needs, and I would favour some of the other solutions posted depending on the work being done.

import numpy
a = numpy.array([1,1,1])
b = numpy.random.rand(20,3)

distances = numpy.linalg.norm(a - b, axis = 1)

score 1 · Answer 4 · answered Jul 13 '18 at 08:05

1

Note the extra set of [] in the definition of a

import numpy, scipy.spatial.distance
a = numpy.array([[1,1,1]])
b = numpy.random.rand(20,3)

distances = scipy.spatial.distance.cdist(b, a, metric='euclidean')

answered Jul 13 '18 at 08:05

elfnor

469
1
5
14

Clever, could you elucidate on why this works? Thanks. – Will Jul 18 '18 at 15:18

Numpy distance calculations of different shaped arrays

4 Answers4