3

Given I have two different lists with ints.

a = [1, 4, 11, 20, 25] and b = [3, 10, 20]

I want to return a list of length len(b) that stores the closest number in a for each ints in b.

So, this should return [4, 11, 20].

I can do this in brute force, but what is a more efficient way to do this?

EDIT: It would be great if I can do this with standard library, if needed, only.

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
Dawn17
  • 7,825
  • 16
  • 57
  • 118

6 Answers6

3
>>> a = [1, 4, 11, 20, 25]
>>> b = [3, 10, 20]
>>> 
>>> ans = list(map(lambda y:min(a, key=lambda x:abs(x-y)),b))
>>> ans
[4, 11, 20]

It's a loop for the question 'get number closest given a value'

value = #number
min(a, key=lambda x:abs(x-value))
Harry
  • 83
  • 6
3

Possibly a more optimal solution is to use K-D Trees:

import numpy as np
from scipy.spatial import cKDTree
def agn_val(a, b):
    """ Return **values** in a closest to the values in b """
    a = np.asarray(a)
    tr = cKDTree(a[:, None])
    return a[tr.query(np.atleast_2d(b).T)[1]].tolist()

def agn_idx(a, b):
    """ Return **indices of values** in a closest to the values in b """
    tr = cKDTree(np.atleast_2d(a).T)
    return tr.query(np.atleast_2d(b).T)[1].tolist()

Timings:

Below I use a test similar to @eugenhu except I increase the input list sizes (tests with small input lists are not accurate).

Also, let's define @jpp function:

def jpp(a, b):
    a = np.asarray(a)
    b = np.asarray(b)
    return [a[np.abs(a - i).argmin()] for i in b]

NOTE: At the expense of memory usage, the following variation of @jpp function is somewhat faster:

def jpp2(a, b):
    a = np.asarray(a)
    return a[np.argmin(np.abs(np.subtract.outer(a, b)), axis=0)]

I have also found this solution: https://stackoverflow.com/a/45350318/8033585 that returns indices (like agn_idx()). A modified version that returns values is:

def closest_val(a, b):
    B = np.asarray(a)
    A = np.asarray(b)
    # original code from https://stackoverflow.com/a/45350318/8033585:
    L = B.size
    sidx_B = B.argsort()
    sorted_B = B[sidx_B]
    sorted_idx = np.searchsorted(sorted_B, A)
    sorted_idx[sorted_idx==L] = L-1
    mask = (sorted_idx > 0) & \
    ((np.abs(A - sorted_B[sorted_idx-1]) < np.abs(A - sorted_B[sorted_idx])) )
    return B[sidx_B[sorted_idx-mask]]

Then I generate a sample:

random.seed(0) # for repeatability
a = random.sample(range(1, 10000), 500) # a contains unique values
b = [random.randint(0, 10000) for i in range(1000)]

Now timings:

In [65]: %timeit f(a, b)
113 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [66]: %timeit g(a, b)
72.7 ms ± 1.59 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [67]: %timeit jpp(a, b)
3.15 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [68]: %timeit jpp2(a, b)
1.69 ms ± 23.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [69]: %timeit agn_val(a, b)
934 µs ± 9.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [70]: %timeit closest_val(a, b)
144 µs ± 3.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
AGN Gazer
  • 8,025
  • 2
  • 27
  • 45
1

Here's a partially vectorised (but still brute force) solution with NumPy. You should see large performance improvements versus any brute-force list-based method. Via sorting you can achieve O(n log n) time complexity, e.g. see this answer.

import numpy as np

a = np.array([1, 4, 11, 20, 25])
b = np.array([3, 10, 20])

res = [a[np.abs(a - i).argmin()] for i in b]

# [4, 11, 20]
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Is there a way to do it without numpy? – Dawn17 Oct 14 '18 at 02:34
  • 4
    @Dawn17, Probably. But that wasn't your question: `I can do this in brute force, but what is a more efficient way to do this?` NumPy is a good way to *avoid* or *optimise* brute-force methods. – jpp Oct 14 '18 at 02:34
  • Sorry. Would be great if you can help me doing without it. – Dawn17 Oct 14 '18 at 02:35
  • _"Would be great if you can help me doing without it."_ @Dawn17 Why? – AGN Gazer Oct 14 '18 at 03:52
  • At the expense of memory usage, here is a variation that is somewhat faster: `a = np.asarray(a); return a[np.argmin(np.abs(np.subtract.outer(a, b)), axis=0)]` – AGN Gazer Oct 14 '18 at 04:50
1

One way could be to sort a, b first, for each b[i], find the closest element in a, call this a[j_i]; then throw away the elements smaller than a[j_i] (i.e. a=a[j_i:]), repeat for b[i+1]. Use whatever algorithm you want to find the closest element in a to a given value:

a = [1, 4, 11, 20, 25]
b = [3, 10, 20]

a_tmp = sorted(a)

# Sort `b` but keep a record of each element's original index.
b_tmp = sorted(enumerate(b), key=lambda x: x[1])

# Initialise an 'empty' output array.
out = [None]*len(b)

for i, v in b_tmp:
    # Throw away elements in `a_tmp` smaller than the "current closest element to `v`"
    # (`v` is the current element of `b` being considered) since they can't possibly
    # be closer to the next larger element in `b` than the
    # "current closest element to `v`".
    a_tmp = a_tmp[min(enumerate(a_tmp),
                      key=lambda x: abs(x[1]-v))[0]:]
    out[i] = a_tmp[0]

print(out)

The 'brute force' method list(map(lambda y:min(a, key=lambda x:abs(x-y)),b)) (from this answer) will be faster for small a, b lists (around len(a)=10, len(b)=5) since (I believe) it doesn't carry the overhead of first sorting the input lists.

Timings:

import random

param = 10000
a = [random.randint(-100*param,100*param) for i in range(param)]
b = [random.randint(-100*param,100*param) for i in range(param//100)]

def f(a,b):
    return [min(a, key=lambda x:abs(x-y)) for y in b]

def g(a,b):
    a = sorted(a)
    ib = sorted(enumerate(b), key=lambda x: x[1])
    out = [None]*len(b)

    for i, b_i in ib:
        a = a[min(enumerate(a),key=lambda ia: abs(ia[1]-b_i))[0]:]
        out[i] = a[0]

    return out

%timeit f(a,b)
%timeit g(a,b)

285 ms ± 26.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
172 ms ± 7.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using Numpy

import random
import numpy as np

param = 10000
a = [random.randint(-100*param,100*param) for i in range(param)]
b = [random.randint(-100*param,100*param) for i in range(param//100)]

# `f_np()` and `g_np()` expect regular Python lists as arguments and convert them
# to numpy arrays internally, returning the results as Python lists.

def f_np(a,b):  # from https://stackoverflow.com/a/52798995/8944057
    a = np.array(a)
    b = np.array(b)
    return [a[np.abs(a - i).argmin()] for i in b]

def g_np(a,b):
    a = np.sort(a)
    b_idx = np.argsort(b)
    out = [None]*len(b)

    for i in b_idx:
        a = a[np.abs(a - b[i]).argmin():]
        out[i] = a[0]

    return out

%timeit f_np(a,b)
%timeit g_np(a,b)

3.47 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.82 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

is a lot faster, even the equivalent 'brute force' method is many times faster. But if you're going to open up yourself to other libraries then also see this answer.

eugenhu
  • 1,168
  • 13
  • 22
0

Use binary search, assuming the lists are in order.

The brute force in this case is only O(n), so I wouldn't worry about it, just use brute force.

EDIT: yeh it is O(len(a)*len(b)) (roughly O(n^2) sorry stupid mistake.

Since these aren't necessarily sorted the fastest is still O(len(a)*len(b)) though. Sorting the lists (using timsort) would take O(nlogn), then binary search O(logn), which results in O(nlog^2n)*O(n)=O(n^2log^2n), which is slower then just O(n^2).

Recessive
  • 1,780
  • 2
  • 14
  • 37
  • Brute force is O(len(a) * len(b)) – Karl Oct 14 '18 at 02:43
  • This is the only solution which suggests (in very raw terms) a way to avoid O(*mn*) time complexity. – jpp Oct 14 '18 at 02:57
  • You wouldn't multiply the time it takes to sort `a` by the time it takes to find the closest elements in `a` from `b` yeah? It should just be O(#a log #a) + O(#b log #a) since you're not sorting it each time you do the search. – eugenhu Oct 14 '18 at 04:51
  • yeh you're right. I need to stop multiplying things haha. – Recessive Oct 15 '18 at 04:34
0

The time complexity can be O(nlgn) using lower_bound. (The function from cpp, it use binary search to find element in sorted array).

I didn't find available and suitable lower_bound function in python. So implement it directly.

def lower_bound(sequence, value, compare):
    elements = len(sequence)
    offset = 0
    middle = 0
    found = len(sequence)

    while elements > 0:
        middle = elements >> 1
        if compare(value, sequence[offset + middle]) > 0:
            offset = offset + middle + 1
            elements = elements - (middle + 1)
        else:
            found = offset + middle
            elements = middle
    return found

Then invoke it as lower_bound:

a = [1, 4, 11, 20, 25]
b = [3, 10, 20]

a.sort()
re_a = a[::-1]

[min([a[lower_bound(a, i, lambda x,y: x > y) % len(a)], re_a[lower_bound(re_a, i, lambda x, y: x < y) % len(re_a)]], key=lambda num:abs(num-i)) for i in b]
# [4, 11, 20]

In every Iteration of b, find the first element bigger or equals to element and find the first element smaller or equals to element, then compare them and select suitable one.

So the time complexity will be O(nlgn), better than using brute force.

pwxcoo
  • 2,903
  • 2
  • 15
  • 21