4

I have a very large ndarray A, and a sorted list of points k (a small list, about 30 points).

For every element of A, I want to determine the closest element in the list of points k, together with the index. So something like:

>>> A = np.asarray([3, 4, 5, 6])
>>> k = np.asarray([4.1, 3])
>>> values, indices
[3, 4.1, 4.1, 4.1], [1, 0, 0, 0]

Now, the problem is that A is very very large. So I can't do something inefficient like adding one dimension to A, take the abs difference to k, and then take the minimum of each column.

For now I have been using np.searchsorted, as shown in the second answer here: Find nearest value in numpy array but even this is too slow. This is the code I used (modified to work with multiple values):

def find_nearest(A,k):

    indicesClosest = np.searchsorted(k, A)
    flagToReduce = indicesClosest==k.shape[0]
    modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
    modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1
    flagToReduce = np.logical_or(flagToReduce,
                     np.abs(A-k[indicesClosest-1]) <
                     np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
    flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
    indicesClosest[flagToReduce] -= 1
    valuesClosest = k[indicesClosest]
    return valuesClosest, indicesClosest

I then thought of using scipy.spatial.KDTree:

>>> d = scipy.spatial.KDTree(k)
>>> d.query(A)

This turns out to be much slower than the searchsorted solution.

On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k.

Is there something like that?

Edit

At the moment I am using a variant of np.searchsorted that requires the array A to be sorted. We can do this in advance as a pre-processing step, but we still have to restore the original order after computing the indices. This variant is about twice as fast as the one above.

A = np.random.random(3000000)
k = np.random.random(30)

indices_sort = np.argsort(A)
sortedA = A[indices_sort]

inv_indices_sort = np.argsort(indices_sort)
k.sort()


def find_nearest(sortedA, k):
    midpoints = k[:-1] + np.diff(k)/2
    idx_aux = np.searchsorted(sortedA, midpoints)
    idx = []
    count = 0
    final_indices = np.zeros(sortedA.shape, dtype=int)
    old_obj = None
    for obj in idx_aux:
        if obj != old_obj:
            idx.append((obj, count))
            old_obj = obj
        count += 1
    old_idx = 0
    for idx_A, idx_k in idx:
        final_indices[old_idx:idx_A] = idx_k
        old_idx = idx_A
    final_indices[old_idx:] = len(k)-1

    indicesClosest = final_indices[inv_indices_sort] #<- this takes 90% of the time
    return k[indicesClosest], indicesClosest

The line that takes so much time is the line that brings the indices back to their original order.

Ant
  • 5,151
  • 2
  • 26
  • 43
  • You have more than one `value`. So, are you looping when using `searchsorted`? Show your searchsorted attempt? Or did you use this code - https://stackoverflow.com/a/26026189/? – Divakar Oct 11 '17 at 17:04
  • Please be more specific than "very very large". Give a typical size of `A`. – Warren Weckesser Oct 11 '17 at 17:05
  • @Divakar Yes, I used that code :) I will edit it in – Ant Oct 11 '17 at 17:16
  • @WarrenWeckesser I have about 20 arrays A with an average of 5 million elements each. Some bigger, some smaller. I need to do this for every array A. – Ant Oct 11 '17 at 17:17
  • Don't think that's your attempt because it doesn't work with multiple values in `k`. – Divakar Oct 11 '17 at 17:19
  • @Divakar You're right, I edited. It is similar, I just take care of possible out of bounds exception. I don't like the copying of the array but I am not sure how to avoid it. In any case I don't think that's the bottleneck – Ant Oct 11 '17 at 17:25
  • See if these help out - https://stackoverflow.com/a/37842324/, https://stackoverflow.com/a/45350318/ to improve upon existing searchsorted attempts. – Divakar Oct 11 '17 at 17:54
  • Depending on the dimensionality of the problem, you could make a bitmap image/array covering the range of values with sufficient resolution to differentiate values of k. The bitmap contains the index of the nearest value of k (a Voronoi tesselation). You lookup the values of A in the bitmap to get the nearest k. Depending on the bitmap resolution and problem, you may also have to check the adjacent values in the bitmap are the same, and if they are not (i.e. on an edge) individually check which is the closest of k for those values. – xioxox Oct 12 '17 at 07:34
  • @Divakar Thank you! The first solution is indeed about 20% faster than the searchsorted solution I implemented. In the meanwhile I found a much faster solution on the scipy mailing list, that is 4.3x faster than what I had. I am still searching for something faster, then I will post an answer with the algorithm – Ant Oct 12 '17 at 10:56
  • @xioxox Thank you for your answer! I am not sure what you mean exactly. Something like using Voronoi tesselation, finding the simplex into which each point is, and then manually check each vertex in the tesselation to return the one with the smallest distance? If so, are there already implementend function that do that? Because running a for loop in python looking for 5 million points will most likely be slower than searchsorted – Ant Oct 12 '17 at 10:58
  • I don't think there's anything built-in. I meant to just do an array lookup in the nearest k table, where that is tabulated with some discrete bin size. This would only work if the number of entries in the table was much less than the number of values in A. Presumably it would only work for low dimensions. – xioxox Oct 12 '17 at 13:00
  • `only k changes` Will there be *many* queries with unrelated `k`s? (Say, more than 22 if `A.len()` typically is 5 million?) Is it admissible to order the elements in each `A`? – greybeard Oct 12 '17 at 18:02
  • @greybeard Each k will typically be used only once, but successive k will not differ by a lot. i.e. the second k is not too different from the first k, the third not too different from the second, etc.Overall I call this thousands of times.Yes, it's admissable to order the elements of A,but then we have to revert the order.One of the ideas I was exploring was saving the sorted array A,then use np.searchsorted(A, k) to get the required information, then "unsort" the indices to get back to the original order. It turns out, this is much faster but still not fast enough. (to continue....) – Ant Oct 12 '17 at 18:33
  • (continued) and the surprising thing is that 90% of the time is spent unsorting the array, i.e. calling indices_computed_on_sorted_A[indices_to_bring_original_order]. This is the motivation for this question: https://stackoverflow.com/questions/46713121/slicing-repeadlty-with-the-same-slice-numpy where I ask if I there is an efficient way of slicing different vectors using the same indices, since *indices_to_bring_original_order* does not change – Ant Oct 12 '17 at 18:34
  • A variant of `searchsorted()` should be possible that takes advantage of sorted `A` in `np.searchsorted(k, A)`. I'll have to think about an efficient implementation of "unsorting"/"the reverse permutation to sorting `A`". (And try to understand `calling indices_computed_on_sorted_A[indices_to_bring_original_order‌​]`.) – greybeard Oct 12 '17 at 20:11
  • @greybeard Please see my edit :) – Ant Oct 13 '17 at 06:50
  • `[thousands of] successive k will not differ by a lot` - there may be a _huge_ win waiting here: imagine having the k₁, values₁ and indices₁ and just updating the indices in the intervals "between idx_aux₁ and idx_aux₂". – greybeard Oct 13 '17 at 08:43
  • @greybeard Uhm this could be a great idea actually, but could you clarify what you mean with "update the indices between idx_aux_1 and idx_aux_2"? – Ant Oct 13 '17 at 09:50
  • @greybeard actually, very nice suggestions. It goes twice as fast now. I posted the implementation and some speed tests below, if you're interested :) – Ant Oct 13 '17 at 14:28
  • @Divakar I have added an implementation and some tests, if you're interested :) – Ant Oct 13 '17 at 14:28
  • (I'm almost sorry for not taking the time to work out and gauge ideas myself - I claim life getting in the way.) Another "fundamental" question vexing me: how many accesses will there be to the results (`values, indices`)? And will there be any *write* access? (If the number of accesses was *huge*, I can hardly imagine the part discussed in this question to be time critical. What proportion of the results will never be read at all? What is the total/average number of accesses?) Pondering *lazy evaluation* here (and dominance of memory hierarchy effects). – greybeard Oct 14 '17 at 09:59
  • @greybeard values are used twice, indices are used once. Every element of the vectors returned will be read. There are no write accesses, only read :) I am not sure if I can use anything else though, these values are used by an external library that needs its own objects (that contain the vectors I compute here). – Ant Oct 14 '17 at 12:55
  • `Every element [returned] will be read` - no *immediate* advantage (relevant proportion of results never computed) to lazy evaluation, here. (There *might* be an off-chance that writing those "vector"s is "rate limited", and lazy evaluation via redefining `__getitem__` helping there.) – greybeard Oct 15 '17 at 07:59

2 Answers2

2

Update:

The builtin function numpy.digitize can actually do exactly what you need. Only a small trick is required: digitize assigns values to bins. We can convert k to bins by sorting the array and setting the bin borders exactly in the middle between adjacent elements.

import numpy as np

A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3, 1])  # added another value to show that sorting/binning works

ki = np.argsort(k)
ks = k[ki]

i = np.digitize(A, (ks[:-1] + ks[1:]) / 2)

indices = ki[i]
values = ks[i]

print(values, indices)
# [ 3.   4.1  4.1  4.1] [1 0 0 0]

Old answer:

I would take a brute-force approach to perform one vectorized pass over A for each element in k and update those locations where the current element improves the approximation.

import numpy as np

A = np.asarray([3, 4, 5, 6])
k = np.asarray([4.1, 3])

err = np.zeros_like(A) + np.inf  # keep track of error over passes

values = np.empty_like(A, dtype=k.dtype)
indices = np.empty_like(A, dtype=int)

for i, v in enumerate(k):
    d = np.abs(A - v)
    mask = d < err  # only update where v is closer to A
    values[mask] = v
    indices[mask] = i
    err[mask] = d[mask]

print(values, indices)
# [ 3.   4.1  4.1  4.1] [1 0 0 0]

This approach requires three temporary variables of same size as A, so it will fail if not enough memory is available.

MB-F
  • 22,770
  • 4
  • 61
  • 116
  • Thanks for your answer! The brute force solution is unfortunately way too slow. np.digitize is a good idea but I don't think is any different from np.searchsorted, right? We have slightly different interfaces but it will run at similar speeds. Most likely the only way to improve this is use the fact that the matrix A never changes, only k does; so pre-process A somehow and bring it in a format for which it is easier to perform the necessary computations – Ant Oct 12 '17 at 18:39
  • @Ant I think you are right. I am not familiar with `searchsorted`, so this similarity somehow went unnoticed by me. However, it may be worth giving `digitize` a try anyway. Sometimes very similar numpy functions show surprising differences in performance. – MB-F Oct 12 '17 at 18:47
2

So, after some work and an idea from the scipy mailing list, I think that in my case (with a constant A and slowly varying k), the best way to do this is to use the following implementation.

class SearchSorted:
    def __init__(self, tensor, use_k_optimization=True):

        '''
        use_k_optimization requires storing 4x the size of the tensor.
        If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
        When this happens, we can cut the running time significantly by storing additional variables. If it won't be
        called with successive k, set the flag to False, as otherwise would just consume more memory for no
        good reason
        '''

        self.indices_sort = np.argsort(tensor)
        self.sorted_tensor = tensor[self.indices_sort]
        self.inv_indices_sort = np.argsort(self.indices_sort)
        self.use_k_optimization = use_k_optimization

        self.previous_indices_results = None
        self.prev_idx_A_k_pair = None

    def query(self, k):
        midpoints = k[:-1] + np.diff(k) / 2
        idx_count = np.searchsorted(self.sorted_tensor, midpoints)
        idx_A_k_pair = []
        count = 0

        old_obj = 0
        for obj in idx_count:
            if obj != old_obj:
                idx_A_k_pair.append((obj, count))
                old_obj = obj
            count += 1

        if not self.use_k_optimization or self.previous_indices_results is None:
            #creates the index matrix in the sorted case
            final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
            #and now unsort it to match the original tensor position
            indicesClosest = final_indices[self.inv_indices_sort]
            if self.use_k_optimization:
                self.prev_idx_A_k_pair = idx_A_k_pair
                self.previous_indices_results = indicesClosest
            return indicesClosest

        old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
        new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
        mask = new_indices_unsorted != old_indices_unsorted

        self.prev_idx_A_k_pair = idx_A_k_pair
        self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
        indicesClosest = self.previous_indices_results

        return indicesClosest

    @staticmethod
    def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
        old_idx = 0
        final_indices = np.zeros(matrix_shape, dtype=int)
        for idx_A, idx_k in idx_A_k_pair:
            final_indices[old_idx:idx_A] = idx_k
            old_idx = idx_A
        final_indices[old_idx:] = len_quant_points - 1
        return final_indices

The idea is to sort the array A beforehand, then use searchsorted of A on the midpoints of k. This gives the same information as before, in that it tells us exactly which points of A are closer to which points of k. The method _create_indices_matrix will create the full indices array from these informations, and then we will unsort it to recover the original order of A. To take advantage of slowly varying k, we save the last indices and we determine which indices we have to change; we then change only those. For slowly varying k, this produces superior performance (at a quite bigger memory cost, however).

For random matrix A of 5 million elements and k of about 30 elements, and repeating the experiments 60 times, we get

Function search_sorted1; 15.72285795211792s
Function search_sorted2; 13.030786037445068s
Function query; 2.3306031227111816s <- the one with use_k_optimization = True
Function query; 4.81286096572876s   <- with use_k_optimization = False

scipy.spatial.KDTree.query is too slow, and I don't time it (above 1 minute, though). This is the code used to do the timing; contains also the implementation of search_sorted1 and 2.

import numpy as np
import scipy
import scipy.spatial
import time


A = np.random.rand(10000*500) #5 million elements
k = np.random.rand(32)
k.sort()

#first attempt, detailed in the answer, too
def search_sorted1(A, k):
    indicesClosest = np.searchsorted(k, A)
    flagToReduce = indicesClosest == k.shape[0]
    modifiedIndicesToAvoidOutOfBoundsException = indicesClosest.copy()
    modifiedIndicesToAvoidOutOfBoundsException[flagToReduce] -= 1

    flagToReduce = np.logical_or(flagToReduce,
                        np.abs(A-k[indicesClosest-1]) <
                        np.abs(A - k[modifiedIndicesToAvoidOutOfBoundsException]))
    flagToReduce = np.logical_and(indicesClosest > 0, flagToReduce)
    indicesClosest[flagToReduce] -= 1

    return indicesClosest

#taken from @Divakar answer linked in the comments under the question
def search_sorted2(A, k):
    indicesClosest = np.searchsorted(k, A, side="left").clip(max=k.size - 1)
    mask = (indicesClosest > 0) & \
           ((indicesClosest == len(k)) | (np.fabs(A - k[indicesClosest - 1]) < np.fabs(A - k[indicesClosest])))
    indicesClosest = indicesClosest - mask

    return indicesClosest
def kdquery1(A, k):
    d = scipy.spatial.cKDTree(k, compact_nodes=False, balanced_tree=False)
    _, indices = d.query(A)
    return indices

#After an indea on scipy mailing list
class SearchSorted:
    def __init__(self, tensor, use_k_optimization=True):

        '''
        Using this requires storing 4x the size of the tensor.
        If use_k_optimization is True, the class will assume that successive calls will be made with similar k.
        When this happens, we can cut the running time significantly by storing additional variables. If it won't be
        called with successive k, set the flag to False, as otherwise would just consume more memory for no
        good reason
        '''

        self.indices_sort = np.argsort(tensor)
        self.sorted_tensor = tensor[self.indices_sort]
        self.inv_indices_sort = np.argsort(self.indices_sort)
        self.use_k_optimization = use_k_optimization

        self.previous_indices_results = None
        self.prev_idx_A_k_pair = None

    def query(self, k):
        midpoints = k[:-1] + np.diff(k) / 2
        idx_count = np.searchsorted(self.sorted_tensor, midpoints)
        idx_A_k_pair = []
        count = 0

        old_obj = 0
        for obj in idx_count:
            if obj != old_obj:
                idx_A_k_pair.append((obj, count))
                old_obj = obj
            count += 1

        if not self.use_k_optimization or self.previous_indices_results is None:
            #creates the index matrix in the sorted case
            final_indices = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
            #and now unsort it to match the original tensor position
            indicesClosest = final_indices[self.inv_indices_sort]
            if self.use_k_optimization:
                self.prev_idx_A_k_pair = idx_A_k_pair
                self.previous_indices_results = indicesClosest
            return indicesClosest

        old_indices_unsorted = self._create_indices_matrix(self.prev_idx_A_k_pair, self.sorted_tensor.shape, len(k))
        new_indices_unsorted = self._create_indices_matrix(idx_A_k_pair, self.sorted_tensor.shape, len(k))
        mask = new_indices_unsorted != old_indices_unsorted

        self.prev_idx_A_k_pair = idx_A_k_pair
        self.previous_indices_results[self.indices_sort[mask]] = new_indices_unsorted[mask]
        indicesClosest = self.previous_indices_results

        return indicesClosest

    @staticmethod
    def _create_indices_matrix(idx_A_k_pair, matrix_shape, len_quant_points):
        old_idx = 0
        final_indices = np.zeros(matrix_shape, dtype=int)
        for idx_A, idx_k in idx_A_k_pair:
            final_indices[old_idx:idx_A] = idx_k
            old_idx = idx_A
        final_indices[old_idx:] = len_quant_points - 1
        return final_indices

mySearchSorted = SearchSorted(A, use_k_optimization=True)
mySearchSorted2 = SearchSorted(A, use_k_optimization=False)
allFunctions = [search_sorted1, search_sorted2,
                mySearchSorted.query,
                mySearchSorted2.query]

print(np.array_equal(mySearchSorted.query(k), kdquery1(A, k)[1]))
print(np.array_equal(mySearchSorted.query(k), search_sorted2(A, k)[1]))
print(np.array_equal(mySearchSorted2.query(k), search_sorted2(A, k)[1]))

if __name__== '__main__':
    num_to_average = 3
    for func in allFunctions:
        if func.__name__ == 'search_sorted3':
            indices_sort = np.argsort(A)
            sA = A[indices_sort].copy()
            inv_indices_sort = np.argsort(indices_sort)
        else:
            sA = A.copy()
        if func.__name__ != 'query':
            func_to_use = lambda x: func(sA, x)
        else:
            func_to_use = func
        k_to_use = k
        start_time = time.time()
        for idx_average in range(num_to_average):
            for idx_repeat in range(10):
                k_to_use += (2*np.random.rand(*k.shape)-1)/100 #uniform between (-1/100, 1/100)
                k_to_use.sort()
                indices = func_to_use(k_to_use)
                if func.__name__ == 'search_sorted3':
                    indices = indices[inv_indices_sort]
                val = k[indices]

        end_time = time.time()
        total_time = end_time-start_time

        print('Function {}; {}s'.format(func.__name__, total_time))

I'm sure that it still possible to do better (I use a loot of space for SerchSorted class, so we could probably save something). If you have any ideas for an improvement, please let me know!

Ant
  • 5,151
  • 2
  • 26
  • 43
  • Commenting the code looks a good idea, including [doc strings](https://www.python.org/dev/peps/pep-0257/)(yes, I've noticed the constructor). [Revamped/retargeted some](https://codereview.stackexchange.com/help/how-to-ask), this should make an excellent question over at [Code Review](https://codereview.stackexchange.com/help/on-topic). Tag performance, numpy and maybe scipy. Use "the Python tags" at own risk. Are there any more observations in the profile? How does the effectiveness of `use_k_optimization` vary "with delta"? – greybeard Oct 15 '17 at 08:07
  • (You can ease the memory pressure some by specifying `int8` for `previous_indices_results`. Using `self.previous_indices_results[mask] \ = new_indices_unsorted[self.inv_indices_sort[mask]]` enhances readability, in my eyes, and `indices_sort` doesn't need to be an instance attribute any longer.) (You hit a sweet spot with `(2*np.random.rand(*k.shape)-1)/100` - `…/20` doesn't give much of a speed-up, and `…/500` doesn't improve much.) – greybeard Oct 17 '17 at 08:47
  • @greybeard Thank you for your suggestions! But will int8 be enough to store the whole vectors, given that is has more than 256 elements? I could probably use int32 instead of 64 though. For the mask, that is how I originally wrote it, but then I thought that is was wrong; because the previous_indices are in the original order of A, while the mask is done on the sorted array A. So if I apply the mask to previous_indices, I am going to get the wrong elements, right? Interesting that ../100 seems to be optimal, thankfully it is the typical behavior in my case :) – Ant Oct 17 '17 at 09:18
  • `will int8 be enough to [index k], given that [the whole vector] has more than 256 elements?` I hope so: the example `k`s (`about 30 points`) could be indexed by "uint5". `I thought that is was wrong` - funny, I had to convince myself that using "the forward permutation" on the left side was right. And equivalent to using "the inverse permutation" on the right. – greybeard Oct 17 '17 at 11:22
  • (`I had to convince myself that using "the forward permutation" on the left side was right` I *still* feel equally uneasy about both.) – greybeard Oct 17 '17 at 12:23
  • @greybeard Ahah yes, I had the same feeling. I tested both versions using the output from one of the other functions, and the correct one is the one I posted above; still, it is confusing :/ – Ant Oct 17 '17 at 14:04
  • Two more ideas: between "queries", keep the indices, and do a linear search in `k` with each value in `A` *starting from the old index*. If there are no old indices, use a "completely unrolled"/"hard-coded" binary search. (If `k` can't be guaranteed to be smaller than 64 entries, escape to regular binary search.) – greybeard Oct 18 '17 at 09:04
  • ("No time" to try this myself, now: I've booked a tricycle, and the sun is *finally* getting the better of the fog.)(Not even time to read [An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine](http://www.cs.hiroshima-u.ac.jp/cs/_media/offline-icpp-final.pdf).) – greybeard Oct 18 '17 at 09:06
  • @greybeard Thanks! A lot of nice ideas :D The thing is that the np.searchsorted call takes about 1% of the computational time. So doing a linear search starting from the old index won't really improve the performance that much. The paper you link would probably do the trick, making sorting the indices much faster; but for now I don't see a way to easily add this to my codebase. :) – Ant Oct 18 '17 at 09:25
  • `A lot of […] ideas` with inconclusive results once tried out (the weather inviting to try out rain coats). Linear search in interpreted Python turned out too slow to mention; for a quick JIT PoP, I converted to Java - a measly 3-times improvement over "query false". It may or may not be worthwhile to try PyPy(3) or convert `query` to Java/Jython to get additional "performance points". – greybeard Oct 21 '17 at 03:11