2

I have two lists. Both are sorted lists of numbers. Say:

A = [1.1, 5.2, 12.3, 12.6]
B = [2.3, 2.7, 5.2, 11.1, 12.1, 15.6, 16.6]

I would like to output in this case:

result = [[2.3], [0.4, 2.5], [5.9, 1]]

and an extra list:

remainder = [3.5, 1]

Where does this come from?

First consider the list of differences between consecutive values in B with an implicit zero added to the start.

[2.3, 0.4, 2.5, 5.9, 1, 3.5, 1]

We need to split this up according to where each value in A was closest to in B.

For each number in A, the nearest value in B is:

1.1 -> 2.3 -> 2.3
5.2 -> 5.2 -> 2.5
12.3 -> 12.1 -> 1
12.6 -> 12.1 -> 1

The rest goes into the remainder variable.

I am looking for a fast (linear time) way to do this in python. Any help very much appreciated. I don't mind if it uses numpy or not, whichever is faster.


My attempts:

I tried to solve this but via a convoluted route. First I make a mapping using:

def find_nearest(array, value):
    idx = np.searchsorted(array, value, transformed_remainderside="left")
    if idx > 0 and (idx == len(array) or math.fabs(value - array[idx-1]) < math.fabs(value - array[idx])):
        return array[idx-1]
    else:
        return array[idx]

Then I use that to make:

[[2.3], [2.7, 5.2], [11.1, 12.1]] and [15.6, 16.6]

Then I make:

[[2.3], [0.4, 2.9], [5.9, 6.9]] and [3.5, 4.5]

Then finally I make [[2.3], [0.4, 2.5], [5.9, 1]] and [3.5, 1]

This is painful and bug prone and also not linear time overall.


Added example

A = [2.3, 2.7, 5.2, 11.1]
B = [2.3, 2.7, 5.2, 11.1]

result = [[2.3], [0.4], [2.5], [5.9]]
remainder = []
Community
  • 1
  • 1
Simd
  • 19,447
  • 42
  • 136
  • 271

2 Answers2

1

This can be done in a very explicit way by splitting the task into two parts: matching the closest number and building the ranges.

Firstly, the code goes through both arrays linearly and chooses the closest number in B for each number in A. Then, the code transforms the structure to the required output of ranges of adjacent numbers and filters out the ranges without any match:

import numpy as np

A = [1.1, 5.2, 12.3, 12.6]
B = [2.3, 2.7, 5.2, 11.1, 12.1, 15.6, 16.6]

# This array will hold the closest numbers in A for each number in B
matches = [[] for _ in B]

i = 0
for num in A:
    # Check if the current number in B is the closest to the current one
    # This assumes both arrays are sorted
    while i < len(B) - 1 and abs(num - B[i]) > abs(num - B[i + 1]):
        i += 1
    matches[i].append(num)

# Unite the pairs so each range has a list of matching numbers
matches = [[matches[0]]] + [l1+l2 for l1, l2 in zip(matches[1::2], matches[2::2])]

# Create a list of diffs and pair them into ranges
diffs = (np.array(B[1:]) - np.array(B[:-1])).tolist()
ranges = [[B[0]]] + list(map(list, zip(diffs[::2], diffs[1::2])))

# Output only the ranges that had at least a single match in A
ranges_with_numbers = [num_range for num_range, range_matches in zip(ranges, matches)
                       if len(range_matches) > 0]
remainder = [num_range for num_range, range_matches in zip(ranges, matches)
             if len(range_matches) == 0]

The complexity is O(n) since the matching phase scans each array just once, and so is the transform phase.

Elisha
  • 23,310
  • 6
  • 60
  • 75
  • This is nice but the output is a list that mixes tuples and lists. Also, how would you get the `remainder` list? – Simd Jul 20 '19 at 09:06
  • 1
    @Anush, converting tuples to lists is trivial in python, I updated the code accordingly. In order to get the remainder needed is to take the ranges without any match. These will be the ones with no match to any of the numbers the compose it. I updated the code accordingly. Hope it helps. – Elisha Jul 20 '19 at 10:31
  • I think this has a bug. Take `A = [0.018585633436070337, 0.5278114398223964, 0.797663687548349, 0.8887323645715489]` `B = [0.6186489064266478, 0.8154860448456667, 0.8696618132500026]`. `ranges_with_numbers` should be `[[0.6186489064266478], [0.19683713841901884], [0.05417576840433591]]` – Simd Jul 22 '19 at 18:41
  • Instead it gives `[[0.6186489064266478], [0.19683713841901884, 0.05417576840433591]]` – Simd Jul 22 '19 at 18:42
  • I did not understand the original requirements. I thought that all segments should be of two consecutive numbers, other than the first one. I guess this not case, can you please explain the segments requirement again? – Elisha Jul 23 '19 at 06:02
  • I added another example which I hope helps. – Simd Jul 23 '19 at 08:17
  • @Anush, just to make sure - it means that if a number from `B` has a single match and the matching number equals to the number in `B` then the range has a single number, right? – Elisha Jul 25 '19 at 10:28
  • Could you give an example of what you have in mind? – Simd Jul 25 '19 at 10:51
1

Here's one based on [np.searchsorted] -

# https://stackoverflow.com/a/45350318/ Variant for already sorted B
def closest_argmin_sortedB(A, sorted_B):
    L = len(sorted_B)
    sorted_idx = np.searchsorted(sorted_B, A)
    sorted_idx[sorted_idx==L] = L-1
    mask = (sorted_idx > 0) & \
    ((np.abs(A - sorted_B[sorted_idx-1]) < np.abs(A - sorted_B[sorted_idx])) )
    return sorted_idx-mask

A = np.asarray(A)
B = np.asarray(B)
d = np.ediff1d(B,to_begin=B[0])
idx = closest_argmin_sortedB(A,B)
idxf = idx[np.r_[True,idx[:-1]!=idx[1:]]]

p = np.split(d,idxf+1)
res,remainder = p[:-1],p[-1]

On bigger lists, to achieve performance boost, we can use zipping to slice and thus split the array/list data. Hence, the last two steps could be replaced by -

s = np.r_[0,idxf+1,len(d)]
res,remainder = [d[i:j] for (i,j) in zip(s[:-2],s[1:-1])], d[s[-2]:s[-1]]
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • This works nicely although I am not sure if it is linear time. Why did you use ediff1d instead of diff? – Simd Jul 20 '19 at 09:10
  • 1
    @Anush `ediff1d` allows us to specify the starting element. And, one of the main tools used here to get those closest indices is with `searchsorted`, which as far as Python is concerned is a vectorized one. So, should be pretty efficient for that step. Just made an update. Can you test it out on your actual dataset and let us know how it performs on performance? – Divakar Jul 20 '19 at 09:53