8

I have searched for a solution to determine distances using einsum for numpy arrays that are not equal in their number of rows, but equal in columns. I have tried various combinations but the only way I can do it successful is using the following code. I am obviously missing something and the literature and numerous threads haven't lead me any closer to a solution. I would appreciate finding a generality such that the origins could be of any number for a destination array of any number. I am only working with 2D arrays and have no intention of extending this to other dimensions. I am also familiar with pdist and cdist and other ways of reaching my desired solution, however, I am only interested in einsum since I want to round out my arsenal of examples. Any help would be appreciated.

import numpy as np
origs = np.array([[0.,0.],[1.,0.],[0.,1.],[1.,1.]])
dests = np.asarray([[4.,0.],[1.,1.],[2.,2.],[2.,3.],[0.,5.]])
for i in origs:
    d =np.sqrt(np.einsum("ij,ij->i", i-dests, i-dests))
    print("orig {}...dist: {}".format(i,d))

The following results are what I am looking for...

orig [ 0.  0.]...dist: [ 4.          1.41421356  2.82842712  3.60555128  5.        ]
orig [ 1.  0.]...dist: [ 3.          1.          2.23606798  3.16227766  5.09901951]
orig [ 0.  1.]...dist: [ 4.12310563  1.          2.23606798  2.82842712  4.        ]
orig [ 1.  1.]...dist: [ 3.16227766  0.          1.41421356  2.23606798  4.12310563]
  • It's worth mentioning for future readers that [`cdist`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html) will still handily beat `np.einsum` in terms of performance (by an order of magnitude in Divakar's example). – ali_m Aug 22 '15 at 15:09
  • @ali_m In my field (GIS), the total number of origins may be in the order of less than a 100 and the destinations less than a thousand or so. In the end, pure python/numpy solutions remove the need to install and maintain other libraries. Although they may be attractive and speedy, they sometimes obfuscate what is actually going on behind the scene. So when orders of magnitude only amount to seconds or even tens of seconds, it becomes a non-issue. And since I teach, I like to students to see a variety of solutions rather than just the speediest. It promotes diversity in thinking. –  Aug 22 '15 at 21:37
  • 1
    Fair enough. Questions about efficiently computing Euclidean distances tend to crop up several times a day here, so my intent was just point future readers in the direction of what's likely to be the fastest solution (especially since most people using numpy also tend to have scipy installed). – ali_m Aug 22 '15 at 21:44
  • I suppose all fields have their differences, numpy and python has a very short history being integrated for use within commercial GIS packages, and it has only been very recently that SciPy, Pandas etc have been accessible within the software. People still view the latter two as a film genre and a bear. –  Aug 22 '15 at 21:57

1 Answers1

12

If I understood the question correctly, the for-loop code that you have posted looks generic to me when considering 2D arrays only. Now, if you are looking to have a generic vectorized solution with a single call to np.einsum, you could bring in broadcasting into the play, like so -

d_all = np.sqrt(np.einsum('ijk->ij',(origs[:,None,:] - dests)**2))

Sample run -

In [85]: origs = np.array([[0.,0.],[1.,0.],[0.,1.],[1.,1.]])
    ...: dests = np.asarray([[4.,0.],[1.,1.],[2.,2.],[2.,3.],[0.,5.]])
    ...: 

In [86]: for i in origs:
    ...:     d =np.sqrt(np.einsum("ij,ij->i", i-dests, i-dests))
    ...:     print(d)
    ...:     
[ 4.          1.41421356  2.82842712  3.60555128  5.        ]
[ 3.          1.          2.23606798  3.16227766  5.09901951]
[ 4.12310563  1.          2.23606798  2.82842712  4.        ]
[ 3.16227766  0.          1.41421356  2.23606798  4.12310563]

In [87]: np.sqrt(np.einsum('ijk->ij',(origs[:,None,:] - dests)**2))
Out[87]: 
array([[ 4.        ,  1.41421356,  2.82842712,  3.60555128,  5.        ],
       [ 3.        ,  1.        ,  2.23606798,  3.16227766,  5.09901951],
       [ 4.12310563,  1.        ,  2.23606798,  2.82842712,  4.        ],
       [ 3.16227766,  0.        ,  1.41421356,  2.23606798,  4.12310563]])

As per the comments by @hpaulj, you can also perform the squaring with np.einsum itself, like so -

subts = origs[:,None,:] - dests
d_all = np.sqrt(np.einsum('ijk,ijk->ij',subts,subts))

Here's a runtime test to compare this with the previous approach that had squaring done outside off np.einsum -

In [7]: def all_einsum(origs,dests):
   ...:     subts = origs[:,None,:] - dests
   ...:     return np.sqrt(np.einsum('ijk,ijk->ij',subts,subts))
   ...: 
   ...: def partial_einsum(origs,dests):
   ...:     return np.sqrt(np.einsum('ijk->ij',(origs[:,None,:] - dests)**2))
   ...: 

In [8]: origs = np.random.rand(400,100)

In [9]: dests = np.random.rand(500,100)

In [10]: %timeit all_einsum(origs,dests)
10 loops, best of 3: 139 ms per loop

In [11]: %timeit partial_einsum(origs,dests)
1 loops, best of 3: 251 ms per loop
Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • 1
    `einsum` could also be used to take the square. – hpaulj Aug 22 '15 at 09:55
  • 2
    @hpaulj Is that by repeating it like : `np.sqrt(np.einsum('ijk,ijk->ij',origs[:,None,:] - dests,(origs[:,None,:] - dests)))` or is there some shorter code to do so? – Divakar Aug 22 '15 at 10:05
  • Yes; though not as one line. – hpaulj Aug 22 '15 at 10:50
  • @hpaulj Nice, it seems its quite faster! Would you like to post that as an answer? Would be nice to share it that way I think. – Divakar Aug 22 '15 at 10:57
  • brilliant! I had played with ellipses to no avail,but I hadn't considered None. I will re-examine the documentation but you have solved the immediate quest. –  Aug 22 '15 at 12:08
  • @hpaulj Added in your suggested way into the code. Hope that's okay! :) – Divakar Aug 22 '15 at 14:10
  • @DanPatterson Added in the actual code to get squaring done with `np.einsum` itself. – Divakar Aug 22 '15 at 14:10
  • @Divakar Is there any way to avoid explicit calculation of `subts = origs[:,None,:] - dests`. In my case `subts` will not fit into memory. Can we use `einsum` to avoid computation of `subts`? Specifically, in my case, `origs` and `dests` are `10000 x 784`. – Autonomous Jun 26 '17 at 09:18
  • @ParagS.Chandakkar See if this helps out - https://stackoverflow.com/a/42994680/3293881 – Divakar Jun 26 '17 at 09:21