3

I have a performance issue while "filtering" an array according to the closest float found in another array.

This is a MWE of the problem:

import numpy as np

def random_data(N):
    # Generate some random data.
    return np.random.uniform(0., 10., N).tolist()

# Data lists.
N1 = 1500
list1 = [random_data(N1), random_data(N1), random_data(N1)]
list2 = random_data(1000)

# Define list1's range.
min_1, max_1 = min(list1[2]), max(list1[2])

# This list will contain the "filtered" list1.
list4 = [[], [], []]

# Go through each element in list2.
for elem2 in list2:

    # If it is located within the list1 range.
    if min_1 <= elem2 <= max_1:

        # Find the closest float in sub-list list1[2] to this float
        # in list2.
        indx, elem1 = min(enumerate(list1[2]), key=lambda x:abs(x[1]-elem2))

        # Store the values in list1 that are associated with the closest float
        # found above.
        list4[0].append(list1[0][indx])
        list4[1].append(list1[1][indx])
        list4[2].append(elem1)

(note that list2 contains fewer elements than list1[2], which is the sub-list I compare it to)

This block works as expected but it is terribly inefficient. I'm certain that the answer lies in the correct application of broadcasting and numpy arrays, but I still haven't managed to get the hang of the former enough to apply it to my problem.

Since I'm after enhancing the performance of this code any solution will do (ie: I'm not bound by an answer making use of broadcasting necessarily)


Add

As a reference, in this similar question I made some time ago Fast weighted euclidean distance between points in arrays, user ali_m used broadcasting to achieve an amazing improvement in performance.

The question is not exactly the same (euclidean distance there instead of absolute value and also the distances in that question had to be weighted) but this question seems to me even simpler that that one.

Couldn't the broadcasting solution ali_m applied to that issue be applied to this one?


Add 2

The answer given by user2357112 with the correction by Eelco Hoogendoorn works very good for my initially defined code. I just realized that I over-simplified it, in my actual code the lists list1[2] and list2 are not necessarily defined in the same range. This would be a more accurate representation of that (this should replace the first lines in the MWE above):

def random_data(N, xi, xf):
    # Generate some random data.
    return np.random.uniform(xi, xf, N).tolist()

# Data lists.
N1 = 1500
list1 = [random_data(N1, 13., 20.), random_data(N1, -1., 4.), random_data(N1, 2., 7.)]
list2 = random_data(1000, 0., 10.)

Now the range of list1[2] is not equal to the range for list2 and thus the answer given fails to reject those points i for which list2[i] > max(list1[2]) or list2[i] < min(list1[2]).

Could the answer be modified to have consider this possibility? I'm very sorry for changing the original code like this, it truly slipped by me.

Community
  • 1
  • 1
Gabriel
  • 40,504
  • 73
  • 230
  • 404
  • 1
    this probably belongs on codereview ... but just 2c why dont you sort list1[2] and then min1,max1 are `[0]` and `[-1]` ... that will let you use binary search to locate the closest value faster ... I think – Joran Beasley Jan 27 '14 at 18:06
  • 1
    @JoranBeasley Sorting is `O(NlogN)` compared to `O(N)` time taken by `min`, `max`. So, `min` and `max` are going to be faster here(at least theoretically). And binary search won't work here as OP is looking to find the index of item with minimum absolute difference. – Ashwini Chaudhary Jan 27 '14 at 18:21
  • ok fair points ... It still seems like he would be able to find the abs difference faster in a sorted list ... and potentially save enough time to make the sorting worth it ... maybe ... – Joran Beasley Jan 27 '14 at 18:51
  • 1
    Binary search can find minimum absolute difference, no problem. – user2357112 Jan 27 '14 at 18:53

2 Answers2

5

Kd-tree is really overkill here, all you need to do is sort the array and use binary search to find the closest value in the sorted array. I wrote an answer a while back about how to use searchsorted to find the closet value to a target in an array. You can use the same idea here:

import numpy as np

def find_closest(A, target):
    #A must be sorted
    idx = A.searchsorted(target)
    idx = np.clip(idx, 1, len(A)-1)
    left = A[idx-1]
    right = A[idx]
    idx -= target - left < right - target
    return idx

def random_data(shape):
    # Generate some random data.
    return np.random.uniform(0., 10., shape)

def main(data, target):
    order = data[2, :].argsort()
    key = data[2, order]
    target = target[(target >= key[0]) & (target <= key[-1])]
    closest = find_closest(key, target)
    return data[:, order[closest]]

N1 = 1500
array1 = random_data((3, N1))
array2 = random_data(1000)
array2[[10, 20]] = [-1., 100]

array4 = main(array1, array2)
Community
  • 1
  • 1
Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • The k-d tree is indeed overkill. I went for it because it offered a convenient interface and much less chance of subtle off-by-one errors than if I went for `searchsorted`; `searchsorted` or something similar was the backup plan for if the k-d tree was too slow. – user2357112 Jan 27 '14 at 20:50
  • Amazing answer, even faster than the one by user2357112! I don't understand the next to last line `array2[[10, 20]] = [-1., 100]`, what do you do that for? – Gabriel Jan 28 '14 at 13:10
  • 1
    It replaces `array[10]` and `array[20]` with -1 and 100. They're there so I can check the result and make sure those values got rejected. If I had used the code from your second update I wouldn't have needed this line. – Bi Rico Jan 28 '14 at 17:12
  • Great, I see now the need for that replacement. I'm selecting this answer because it gives a ~3x increase in speed over user2357112's answer. Thank you both very much guys! – Gabriel Jan 28 '14 at 19:34
3

If you have SciPy, a scipy.spatial.cKDTree can do the job:

import numpy
import scipy.spatial

array1 = numpy.array(list1)
array2 = numpy.array(list2)

# A tree optimized for nearest-neighbor lookup
tree = scipy.spatial.cKDTree(array1[2, ..., numpy.newaxis])

# The distances from the elements of array2 to their nearest neighbors in
# array1, and the indices of those neighbors.
distances, indices = tree.query(array2[..., numpy.newaxis])

array4 = array1[:, indices]

k-d trees are designed for multidimensional data, so this might not be the fastest solution, but it should be pretty darn fast compared to what you have. The k-d tree expects input in the form of a 2D array of points, where data[i] is a 1D array representing the ith point, so the slicing expressions with newaxis are used to put the data into that format. If you need it to be even faster, you could probably do something with numpy.sort and numpy.searchsorted.

If you need to reject data from list2 that falls outside the range of values given by list1[2], that can be accomplished by a preprocessing step:

lowbound = array1[2].min()
highbound = array1[2].max()

querypoints = array2[(array2 >= lowbound) & (array2 <= highbound)]
distances, indices = tree.query(querypoints[..., numpy.newaxis])
user2357112
  • 260,549
  • 28
  • 431
  • 505
  • This returns an error: `ValueError: need more than 1 value to unpack` in the line `tree = scipy.spatial.KDTree(array1[2])`. – Gabriel Jan 27 '14 at 18:17
  • You can test it here: [Try IPython from browser](https://www.pythonanywhere.com/try-ipython/) :) – Ashwini Chaudhary Jan 27 '14 at 18:17
  • 1
    Okay, fixed and tested. Theoretically, it should work. Probably needs more explaining, though. – user2357112 Jan 27 '14 at 18:38
  • Thank you @user2357112, it works now indeed. I tested it on my system and it gives a ~2.5x increase with `N=1500`. It's still not enough for my needs but if there's no better answer in a few days, I'll mark this one as accepted. Thanks again! – Gabriel Jan 27 '14 at 18:40
  • 1
    @Gabriel: Not enough? I guess the overhead of the KD-tree implementation counteracted a lot of the benefit of better asymptotic runtime. I'll try to write something up using `numpy.digitize` and see if that works better. – user2357112 Jan 27 '14 at 18:49
  • @user2357112 unfortunately no. I'm running this several thousand times in a larger code and I was after a ~10x increase _at least_ otherwise my code becomes almost inapplicable :( – Gabriel Jan 27 '14 at 18:51
  • 5
    scipy.spatial.cKDTree will likely give you that 10x increase. notice the extra 'c' there. – Eelco Hoogendoorn Jan 27 '14 at 19:00
  • @EelcoHoogendoorn: Oh, that's what the c means? Should've realized. The documentation doesn't seem to say anything about the difference between the classes. cKDTree ought to be much faster, then. – user2357112 Jan 27 '14 at 19:03
  • Actually the speed up is ~100x! Amazing, this is what I needed, thank you so much to both of you! – Gabriel Jan 27 '14 at 19:13
  • @user2357112 would you mind taking a look at the **Add 2** part of the question to see if your code is modifiable to that case and not the one I originally posted? If it can't I'll still mark your answer as accepted since it correctly solved the original issue. Thank you! – Gabriel Jan 27 '14 at 20:28
  • @Gabriel: Modification made. You should also take a look at Bi Rico's answer, if you haven't already. It's likely to run faster than mine, though I find the techniques used to be a bit more bug-prone. – user2357112 Jan 28 '14 at 03:32
  • Thank you very much @user2357112! Indeed Bi Rico's answer is ~3x faster than this one so I'm marking that one as accepted since I was after performance. Cheers! – Gabriel Jan 28 '14 at 13:08