3

I am converting code from MATLAB to python in order to speed up simple operations. I have written a function which contains nested loops and a conditional statement; the purpose of the loop is to return a list of indices for the nearest elements in array x when compared to array y. I am comparing in the order of 1e5 items which takes about 30 sec to run. Any help to speed this process up will be greatly appreciated! I have had partial sucess with using the numba-pro automatic just in time compiler:

@autojit()
def find_nearest(x,y,idx):
    idx_old = 0
    rng1 = range(y.shape[0])
    rng2 = range(x.shape[0])
    for i in rng1:
        prev = abs(x[idx_old]-y[i])
        for j in rng2:
            if abs(x[j]-y[i]) < prev:
                prev = abs(x[j]-y[i])
                idx_old = j
        idx[i] = idx_old
    return idx

Sorry for being such a noob, I am brand new to python!

Chris Church
  • 197
  • 1
  • 10
  • 1
    Could you update your script to include example data inputs to `find_nearest`, just so it's clear? – JoshAdel May 20 '15 at 20:11
  • For example x = np.array([1.1,2.3,5.9,8.5]), y = np.array([0.2, 5.5, 12]) and idx = np.zeros(np.shape(y)) should return idx = [0,2,3] (the indices of the closest items in x to the items in y. In MATLAB I use knnsearch to perform this and on my large dataset it takes approximately 2.5s to solve; whereas my implementation takes around 30s. The input arrays need not be in any particular sorted order. Thank you for taking a look! – Chris Church May 21 '15 at 06:44
  • I attempted to use the sci-kitlearn's k-nearest-neighbors implementation, it however returns a function trained on a dataset; and training on my full dataset just isn't feasible. My search domain contains 1023848 items and I am trying to find 12325 closest items in it. – Chris Church May 21 '15 at 06:54

2 Answers2

4

Nothing wrong with your Numba code, except that the algorithm is not as efficient as can be. Much better is to sort the x array and do a binary search, very similar to this answer and also this answer:

def find_nearest(x, y):
    indices = np.argsort(x)

    loc = np.searchsorted(x[indices], y)
    right = indices.take(loc, mode='clip')
    left = indices.take(loc-1, mode='clip')

    return np.where(abs(y-x[left]) < abs(y-x[right]), left, right)

On my PC this is about 80x faster than even the KDTree approach for x and y having 106 and 105 elements respectively. About two-thirds of the time is spent argsort-ing the array, so I don't think you can gain much with Numba here.

Community
  • 1
  • 1
  • Thank you very much for your response; I was able to implement your code. It ran around 270x faster on my dataset (finding around 1e4 indices in a set of around 1e6 items). Could you perhaps suggest a tutorial/article/webpage that you found useful in learning to implement more efficient code in Python? I reviewed the similar answers you suggested. Are indexed operations always so much faster? – Chris Church May 25 '15 at 10:01
  • @Chris I learned to code more efficient Python mainly from following the `[Numpy]` tag on this site and experimenting / trial-and-error. Could you explain a bit more what you mean with "Are indexed operations always so much faster"? –  May 25 '15 at 15:18
  • By indexed operations I am referring to functions which return boolean/index arrays for further operations down the line. I originally tried to turn my "find_nearest" function into a single line of matrix and vector operations; I however encountered a "MemoryError" when trying to run it on data sets of a substantial size. – Chris Church May 25 '15 at 16:01
  • @Chris Ah I see. No it's not about indexing operations vs matrix/vector operations. Your original algorithm has complexity O(`m*n`), where `m` and `n` are the lengths of `x` and `y`. You were probably also creating an array of size `m*n` (very big). With my code the complexity is O(`(m+n)log(m)`) and additional memory usage in the order of O(`m+n`) (I think.. not sure about the sorting). –  May 25 '15 at 22:32
1

I have found an interim solution to my problem. By implementing the scipy.spatial's kdtree I was able to cut down the run time from 32s to just under 10s. This is still four times slower than the MATLAB knnsearch algorithm; and understanding how to speed up loops with conditional statements is still important. But for the moment this revised implementation is faster:

from scipy import spatial
from numpy import matrix

tree = spatial.KDTree(matrix(x).T)
(_, idxx) = tree.query(matrix(y).T)

The arrays x and y were in flat 1d formats; the tree required queries to be in column vector form.

Any suggestions to improve the run time of the original implementation would be greatly appreciated!

Chris Church
  • 197
  • 1
  • 10