Vectorize finding closest value in an array for each element in another array

Question

Input

known_array : numpy array; consisting of scalar values only; shape: (m, 1)

test_array : numpy array; consisting of scalar values only; shape: (n, 1)

Output

indices : numpy array; shape: (n, 1); For each value in test_array finds the index of the closest value in known_array

residual : numpy array; shape: (n, 1); For each value in test_array finds the difference from the closest value in known_array

Example

In [17]: known_array = np.array([random.randint(-30,30) for i in range(5)])

In [18]: known_array
Out[18]: array([-24, -18, -13, -30,  29])

In [19]: test_array = np.array([random.randint(-10,10) for i in range(10)])

In [20]: test_array
Out[20]: array([-6,  4, -6,  4,  8, -4,  8, -6,  2,  8])

Sample Implementation (Not fully vectorized)

def find_nearest(known_array, value):
    idx = (np.abs(known_array - value)).argmin()
    diff = known_array[idx] - value
    return [idx, -diff]

In [22]: indices = np.zeros(len(test_array))

In [23]: residual = np.zeros(len(test_array))

In [24]: for i in range(len(test_array)):
   ....:     [indices[i], residual[i]] =  find_nearest(known_array, test_array[i])
   ....:     

In [25]: indices
Out[25]: array([ 2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.,  2.])

In [26]: residual
Out[26]: array([  7.,  17.,   7.,  17.,  21.,   9.,  21.,   7.,  15.,  21.])

What is the best way to speed up this task? Cython is an option, but, I would always prefer to be able to remove the for loop and let the code remain are pure NumPy.

NB: Following Stack Overflow questions were consulted

Updates

I did some small benchmarks for comparing the non-vectorized and vectorized solution (accepted answer).

In [48]: [indices1, residual1] = find_nearest_vectorized(known_array, test_array)

In [53]: [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)

In [54]: indices1==indices2
Out[54]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],   dtype=bool)

In [55]: residual1==residual2
Out[55]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True], dtype=bool)

In [56]: %timeit [indices2, residual2] = find_nearest_non_vectorized(known_array, test_array)
10000 loops, best of 3: 173 µs per loop

In [57]: %timeit [indices1, residual1] = find_nearest_vectorized(known_array, test_array)
100000 loops, best of 3: 16.8 µs per loop

About a 10-fold speedup!

Clarification

known_array is not sorted.

I ran the benchmarks as given in the answer by @cyborg below.

Case 1: If known_array were sorted

known_array = np.arange(0,1000)
test_array = np.random.randint(0, 100, 10000)
print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
    print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))
    assert np.allclose(base(known_array, test_array), eval(func_name+'(known_array, test_array)'))

Speedups:
diffs is x0.4 faster than base.
searchsorted1 is x81.3 faster than base.
searchsorted2 is x107.6 faster than base.

Firstly, for large arrays diffs method is actually slower, it also eats up a lot of RAM and my system hanged when I ran it on actual data.

Case 2 : When known_array is not sorted; which represents actual scenario

known_array = np.random.randint(0,100,100)
test_array = np.random.randint(0, 100, 100)

Speedups:
diffs is x8.9 faster than base.
AssertionError                            Traceback (most recent call last)
<ipython-input-26-3170078c217a> in <module>()
      5 for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
      6     print func_name + ' is x%.1f faster than base.' % (base_time /  time_f(func_name))
----> 7     assert np.allclose(base(known_array, test_array),  eval(func_name+'(known_array, test_array)'))

AssertionError: 


searchsorted1 is x14.8 faster than base.

I must also comment that the approach should also be memory efficient. Otherwise my 8 GB of RAM is not sufficient. In the base case, it is easily sufficient.

It does not matter if your data is sorted; the method posted by HYRY takes care of that scenario, and has linear rather than the quadratic memory performance of the diff method; his answer should be marked as the correct one — Eelco Hoogendoorn, Jan 14 '14 at 21:36

score 7 · Answer 1 · answered Dec 26 '13 at 12:47

If the array is large, you should use searchsorted:

import numpy as np
np.random.seed(0)
known_array = np.random.rand(1000)
test_array = np.random.rand(400)

%%time
differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))
indices = np.abs(differences).argmin(axis=0)
residual = np.diagonal(differences[indices,])

output:

CPU times: user 11 ms, sys: 15 ms, total: 26 ms
Wall time: 26.4 ms

searchsorted version:

%%time

index_sorted = np.argsort(known_array)
known_array_sorted = known_array[index_sorted]

idx1 = np.searchsorted(known_array_sorted, test_array)
idx2 = np.clip(idx1 - 1, 0, len(known_array_sorted)-1)

diff1 = known_array_sorted[idx1] - test_array
diff2 = test_array - known_array_sorted[idx2]

indices2 = index_sorted[np.where(diff1 <= diff2, idx1, idx2)]
residual2 = test_array - known_array[indices]

output:

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 311 µs

We can check that the results is the same:

assert np.all(residual == residual2)
assert np.all(indices == indices2)

I have added a Clarification section towards the end of the question to explain this — Nipun Batra, Dec 28 '13 at 05:16
The search sorted algorithm works well for me but it fails if any value of `test_array` is larger than the max value of `known_array`. In this case, `np.searchsorted` will return an index which is 1 too large to be used as an index into `known_array_sorted`. I have edited the answer above to fix this case. Please check it works for you! — Jack Kelly, Jul 08 '14 at 14:24

cyborg · Answer 2 · 2013-12-29T21:09:46.910

TL;DR: use numpy.searchsorted().

import inspect
from timeit import timeit
import numpy as np

known_array = np.arange(-10, 10)
test_array = np.random.randint(-10, 10, 1000)
number = 1000

def base(known_array, test_array):
    def find_nearest(known_array, value):
        idx = (np.abs(known_array - value)).argmin()
        return idx
    indices = np.zeros_like(test_array, dtype=known_array.dtype)
    for i in range(len(test_array)):
        indices[i] =  find_nearest(known_array, test_array[i])
    return indices

def diffs(known_array, test_array):
    differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))
    indices = np.abs(differences).argmin(axis=0)
    return indices

def searchsorted1(known_array, test_array):
    index_sorted = np.argsort(known_array)
    known_array_sorted = known_array[index_sorted]
    idx1 = np.searchsorted(known_array_sorted, test_array)
    idx1[idx1 == len(known_array)] = len(known_array)-1
    idx2 = np.clip(idx1 - 1, 0, len(known_array_sorted)-1)
    diff1 = known_array_sorted[idx1] - test_array
    diff2 = test_array - known_array_sorted[idx2]
    indices2 = index_sorted[np.where(diff1 <= diff2, idx1, idx2)]
    return indices2

def searchsorted2(known_array, test_array):
    index_sorted = np.argsort(known_array)
    known_array_sorted = known_array[index_sorted]
    known_array_middles = known_array_sorted[1:] - np.diff(known_array_sorted.astype('f'))/2
    idx1 = np.searchsorted(known_array_middles, test_array)
    indices = index_sorted[idx1]
    return indices

def time_f(func_name):
    return timeit(func_name+"(known_array, test_array)",
        'from __main__ import known_array, test_array, ' + func_name, number=number)

print('Speedups:')
base_time = time_f('base')
for func_name in ['diffs', 'searchsorted1', 'searchsorted2']:
    print func_name + ' is x%.1f faster than base.' % (base_time / time_f(func_name))

Output:

Speedups:
diffs is x29.9 faster than base.
searchsorted1 is x37.4 faster than base.
searchsorted2 is x64.3 faster than base.

May you please give a sample code which conforms to the test data I have provided? — Nipun Batra, Dec 26 '13 at 08:26
The assertion fails when `known_array` is not sorted; eg. known_array = np.random.randint(0, 100, 100) test_array = np.random.randint(0, 100, 1000) — Nipun Batra, Dec 28 '13 at 04:42
I have added a Clarification section to the end of the question to illustrate the same — Nipun Batra, Dec 28 '13 at 05:15
I removed the assert because it's not defined which value of known_array should be picked if more than one is the closest. — cyborg, Dec 29 '13 at 21:10

alko · Accepted Answer · 2013-12-26T06:28:26.347

4

For example, you can compute all the differences in on go with:

differences = (test_array.reshape(1,-1) - known_array.reshape(-1,1))

And use argmin and fancy indexing along with np.diagonal to get desired indices and differences:

indices = np.abs(differences).argmin(axis=0)
residual = np.diagonal(differences[indices,])

So for

>>> known_array = np.array([-24, -18, -13, -30,  29])
>>> test_array = np.array([-6,  4, -6,  4,  8, -4,  8, -6,  2,  8])

One get

>>> indices
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>>> residual
array([ 7, 17,  7, 17, 21,  9, 21,  7, 15, 21])

edited Dec 26 '13 at 06:28

answered Dec 26 '13 at 06:16

alko

46,136
12
94
102

The solution works! I am just doing some small benchmarks to compare the performance. Would be great if you can highlight how you thought about vectorizing this task. That would be of more help! – Nipun Batra Dec 26 '13 at 06:33
Added the benchmarks in the Update section of the question. It may also be helpful to discuss the memory requirements by the two versions. – Nipun Batra Dec 26 '13 at 06:39

nupam · Answer 4 · 2019-08-29T20:13:15.193

0

The fastest I could come up with. This requires the array to be sorted. The value could be a scalar or a list/array.

def find_nearest(value, array):
    idx = np.searchsorted(array, value, side="left")
    cond = np.logical_and(idx>0, np.logical_or(idx == len(array), np.fabs(value - array[idx-1]) < np.fabs(value - array[idx])))
    return np.where(cond, array[idx-1], array[idx])

edited Aug 29 '19 at 20:13

answered Aug 25 '19 at 22:48

nupam

86
1
4