6

Suppose I have a certain list x with numbers, and another list y with other numbers. Elements of y should be elements of x, but due to noise in measurements, they are kind of different. I want to find, for each value of y, the value of x that is the nearest to it.

I can do this with some loops and check, for each element y[i], which element x[j] minimizes abs(x[j]-y[i]), but I'm pretty sure there is a much easier, cleaner way to do this. The lists could be huge so I'm looking for efficient code here.

The code I've written so far is:

x_in = [1.1, 2.2, 3, 4, 6.2]
y_in = [0.9, 2, 1.9, 6, 5, 6, 6.2, 0.5, 0, 3.1]
desired_output = [1.1, 2.2, 2.2, 6.2, 4, 6.2, 6.2, 1.1, 1.1, 3]

y_out = []

for y in y_in:
    aux = [abs(l - y) for l in x_in]
    mn,idx = min( (aux[i],i) for i in range(len(aux)) )
    y_out.append(x_in[idx])

>>> y_out == desired_output
True

But I don't know if there is a more efficient way to do this...

EDIT:

Due to my ignorance, I forgot to clarify something that may be of relevance according to the comments I've recieved.

  • The x list is sorted.
  • x is the only list that can have a pretty big size: between 500,000 and 1,000,000 elements, in general. y will in general be really small, less than 10 elements.
smci
  • 32,567
  • 20
  • 113
  • 146
Tendero
  • 1,136
  • 2
  • 19
  • 34
  • 2
    How long are `x` and `y`? The loops and check will be polynomial complexity, which is not very good. If performance is important, you could probably get it better with an interval tree. – wim Jul 18 '18 at 20:56
  • 2
    A straight forward approach would be to sort both arrays, then step through `x` until you find an element `e` larger than the current element in `y`, then take the closer of the two (`e` or the element that proceeds it). Continue from that position in `x` until all of `y` is processed, sort of like mergesort. – Dillon Davis Jul 18 '18 at 21:04
  • 1
    @user3483203 I've added my attempt to the question. – Tendero Jul 18 '18 at 21:08
  • How huge is huge? I'd expect wim's interval tree to scale best, but it's a lot of set-up – Useless Jul 18 '18 at 21:16
  • 1
    When you say *"the lists could be huge"*, do you mean the lengths X, Y or both? Anyway, two lists are the wrong data structure to insert into. Use two trees instead (or heap). Then both structures will be sorted by default, and can trivially easily find their (predecessor and successor) neighbors. The rest is trivial. – smci Jul 18 '18 at 21:49
  • If you really want a scalable solution with a big-O estimate, please say that in the question. I tagged this [tag:big-o]. Let us call X, Y the lengths(/sizes) of the structures. Which in general do you expect to dominate: X or Y? (Presumably Y, since there is measurement noise) – smci Jul 18 '18 at 21:51
  • do orders of `x`, `y` matter? – Azat Ibrakov Jul 18 '18 at 22:10
  • @Useless I've added the desired information to the question. Sorry for not stating it before, I didn't know it would be relevant (my bad!). – Tendero Jul 18 '18 at 22:59
  • @smci Only X. I've edited the question! – Tendero Jul 18 '18 at 23:00
  • Do you know the contents of x before you start measuring the y? – smci Jul 18 '18 at 23:10
  • @smci Yes, when one value of `y` arrives, the whole list `x` is determined beforehand. – Tendero Jul 18 '18 at 23:12

6 Answers6

2

Given that x is sorted, the most efficient way to do this is using bisect to search for the closest value. Just create a list of mid points between the x values and run bisect on those:

In [69]: mid_points = [(x1+x2)/2 for x1, x2 in zip(x[1:], x[:-1])]

In [70]: mid_points
Out[70]: [1.5, 2.5, 3.5, 4.5]

In [72]: [x[bisect.bisect(mid_points, v)] for v in y]
Out[72]: [1, 1, 4, 5, 2]

This will run in O(Mlog(N)+N) time where `M=len(y), N=len(x)

(For python2 do from __future__ import division or use float(x1+x2)/2 in the mid_points calculation)

kuppern87
  • 1,125
  • 9
  • 14
  • This is really witty, but I've just tried it and I'm not getting the desired result for the example in the question (the second one). The last element should be 3, and your script is returning 4. – Tendero Jul 18 '18 at 23:06
  • I've found the mistake. Due to `x` having two integers, when you do `(3+4)/2` you get `3`, not `3.5`. If you perform the conversion you get the desired result, and your code clearly outperforms mine and the rest in the other answers. Thanks. – Tendero Jul 18 '18 at 23:28
  • 1
    @Tendero Guess you're on Python 2. You could use `from __future__ import division` to avoid that. – wim Jul 19 '18 at 01:26
  • This will be slower than 1x through the list which is possible – dawg Jul 19 '18 at 03:10
1

You can do this quickly with a lambda function and list comprehension:

[min(x, key=lambda x:abs(x-a)) for a in y]

This will work with floats, integers, etc.

dpwilson
  • 997
  • 9
  • 19
  • 1
    I have no clue. Any constructive criticism, please? – dpwilson Jul 18 '18 at 21:17
  • It's the same as what the OP already had, therefore "not useful". – wim Jul 18 '18 at 21:24
  • A clean, readable answer with actual code is not "the same" as his description of "some loops". – dpwilson Jul 18 '18 at 21:28
  • The question is explicitly asking for greater efficiency, and this answer provides the same complexity. Shorter code is nice, but somewhat missing the point IMO. – wim Jul 18 '18 at 21:30
  • Two things here: 1) Efficiency was never mentioned in OP's original text. Check the edit history. 2) "Easier" and "cleaner" were the two explicit requests. – dpwilson Jul 18 '18 at 21:31
  • OK, that's fair. But now the question was edited to clarify, perhaps you could edit (or delete) the answer. – wim Jul 18 '18 at 21:33
0

So this is something quick I made up that just gets all the differences and than sorts them from least to greatest. Takes the lowest difference, and goes from there.

x = [1, 2, 3, 4, 5]
y = [1.1, 1.2, 3.6, 6.2, 2.1]

for y_index in range(len(y)):
    value_and_index= {}
    for x_index in range(len(x)):
        difference= y[y_index]-x[x_index]
        difference= difference*-1 if difference<0 else difference
        value_and_index[difference]= x_index
    y[y_index]= x[value_and_index[sorted(value_and_index.keys())[0]]]

print y # [1, 1, 4, 5, 2]

Hope this helps, happy coding!

wowwee
  • 176
  • 1
  • 2
  • 12
0

My attempt:

First I sort the X array (if it isn't already sorted). The loop goes through each y and compute absolute value for each x, until this absolute value is higher than previous, then stop the for loop (because array X is sorted):

x = sorted([1, 2, 3, 4, 5])
y = [1.1, 1.2, 3.6, 6.2, 2.1]

out = []
while y:
    current_value = y.pop()
    current_min = float('inf')
    current_x_value = None
    for v in x:
        temp_min = abs(current_value - v)
        if temp_min < current_min:
            current_min = temp_min
            current_x_value = v
        if temp_min > current_min:  # no need to iterate further, X is sorted
            break
    out.insert(0, current_x_value)
print(out)

Outputs:

[1, 1, 4, 5, 2]
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • It is better to sort *both* arrays and step through them with *two* moving iterators. – wim Jul 18 '18 at 21:27
0

If x is sorted, use bisect:

import bisect 
test_out=[]
max_x=max(x)
min_x=min(x)
for f in y:
    if f>=max_x:
        idx=-1
    elif f<=min_x:
        idx=0
    else:
        idx=bisect.bisect_left(x,f)
        if abs(x[idx-1]-f)<abs(x[idx]-f):
            idx-=1
    test_out.append(x[idx])

>>> test_out==desired_output
True
dawg
  • 98,345
  • 23
  • 131
  • 206
  • It's possible for `idx=bisect.bisect_left(x,f)` to return 0, and then the indexing on the next line unintentionally wraps around. – wim Jul 18 '18 at 22:26
  • That case is handled with `f – dawg Jul 18 '18 at 22:32
  • Nope. It would have to be `f<=min_x`. – wim Jul 18 '18 at 22:38
  • Hmm. As I read [bisect.left](https://docs.python.org/2/library/bisect.html#bisect.bisect_left) it states `all(val < x for val in a[lo:i])` for `bisect.left` and `all(val <= x for val in a[lo:i])` for [bisect.right](https://docs.python.org/2/library/bisect.html#bisect.bisect_right) (or `bisect.bisect` - same). I suppose that `elif f<=min_x:` fixes, no? Thanks for paying attention! – dawg Jul 18 '18 at 23:01
  • To put it in plain english, bisection returns the index of where to insert the new element. In case of a tie, `bisect_left` would insert it to the left of the equal element, and `bisect_right` to the right. `f<=min_x` prevents the possibility of a tie occurring at index 0, so yes, it fixes. – wim Jul 19 '18 at 01:22
0

With next assumptions:

  • order of results doesn't matter,

  • we are using Python 3.3+.

pretty straightforward solution may look like

from itertools import repeat


def evaluate(expected_values, measurements):
    if not expected_values:
        raise ValueError('Expected values should be a non-empty sequence.')
    expected_values = sorted(expected_values)
    measurements = sorted(measurements)
    expected_iter = iter(expected_values)
    left_value = next(expected_iter)
    try:
        right_value = next(expected_iter)
    except StopIteration:
        # there is only one expected value
        yield from repeat(left_value,
                          len(measurements))
        return
    for evaluated_count, measurement in enumerate(measurements):
        while measurement > right_value:
            try:
                left_value, right_value = right_value, next(expected_iter)
            except StopIteration:
                # rest of the measurements are closer to max expected value
                yield from repeat(right_value,
                                  len(measurements) - evaluated_count)
                return

        def key(expected_value):
            return abs(expected_value - measurement)

        yield min([left_value, right_value],
                  key=key)

For Python3.3- we can replace

yield from repeat(object_, times)

with for-loop like

for _ in range(times):
    yield object_

Test

>>> x_in = [1.1, 2.2, 3, 4, 6.2]
>>> y_in = [0.9, 2, 1.9, 6, 5, 6, 6.2, 0.5, 0, 3.1, 7.6, 10.4]
>>> y_out = list(evaluate(x_in, y_in))
>>> y_out
[1.1, 1.1, 1.1, 2.2, 2.2, 3, 4, 6.2, 6.2, 6.2, 6.2, 6.2]
Azat Ibrakov
  • 9,998
  • 9
  • 38
  • 50