Efficient lookup table for collection of numpy arrays

Question

I would like to know what's the most efficient way to create a lookup table for floats (and collection of floats) in Python. Since both sets and dicts need the keys to be hashable, I guess can't use some sort of closeness to check for proximity to already inserted, can I? I have seen this answer and it's not quite what I'm looking for as I don't want to give the burden of creating the right key to the user and also I need to extend it for collections of floats. For example, given the following code:

>>> import numpy as np
>>> a = {np.array([0.01, 0.005]): 1}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> a = {tuple(np.array([0.01, 0.005])): 1}
>>> tuple(np.array([0.0100000000000001,0.0050002])) in a
False

I would like the last statement to return True. Coming from a C++ world, I would create a std::map and provide a compare function that can do the comparison with some user defined tolerance to check if the values have been added to the data structure. Of course this question extends naturally to the lookup tables of arrays (for example numpy arrays). So what's the most efficient way to accomplish what I'm looking for?

Are you trying to make a mapping from `float` to `list of float`? — shanmuga, Aug 19 '15 at 13:05
Can't you just provide a custom lookup function (`def look_up(dictionary, float_key)`) to the user that implements the solution you linked to? — Andy Kubiak, Aug 19 '15 at 13:07
I don't want to put any burden on the user. The idea is to have thousands of numpy arrays that represent 3D points so that the creation of new points is avoided if the point is already in the dictionary. But I need to check with respect to some tolerance value (snapping). — aaragon, Aug 19 '15 at 13:13
@Sid I see that answer, but where are you suggesting? The data structure would still need to hash keys and there's gonna have to be some rounding involved, right? — aaragon, Aug 20 '15 at 08:25
For your use case, it might be most useful to create your own class. The `__getitem__` and `__setitem__` are essentially what would do lookups and make entries respectively. You could override those to do everything you need it to - converting to tuples, rounding the values, etc. — Sid, Aug 20 '15 at 12:28

Bas Swinckels · Answer 1 · 2015-08-19T15:44:26.487

Since you are interested in 3D points, you could think about using some data-structure that is optimized for storing spatial data, such as a KD-tree. This is available in Scipy and allows the lookup of the point closest to a given coordinate. After you have looked up the this point, you could then do a check to see if you are within some tolerance to accept the new point or not.

Usage should be something like this (untested, never used it myself):

from scipy.spatial import KDTree
points = ... # points is [Nx3]
tree = KDTree(points)  
new_point = ... # array of length 3
distance, nearest_index = tree.query(new_point)
if distance > tolerance:  # add point
    points = np.vstack((points, new_point))
    tree = KDTree(points)  # generate tree from scratch

Note that a KD-tree is efficient for looking up a point in a static collection of points (cost of a lookup is O(log(N)), but they are not optimized for repeatedly adding new points. The Scipy implementation even lacks a method to add new points, so you have to generate a new tree every time you insert a new point. Since this operation is probably O(N*log(N)), it might be faster to just to do brute-force calculation of all distances, which costs O(N). Note also that there is an alternative version cKDTree, which might be implemented in C for speed, the documentation is not really clear on this.

So this is not what I need, because the idea is to add new points to the tree only if they're not present in the tree. But the tree has to be built from scratch (no points). — aaragon, Aug 19 '15 at 15:50
That is easy to achieve. The first point should be accepted by default, so you do `tree = KDTree([first_point])`. For all the next points, you use the code I showed and only add them when they are not close to existing points. — Bas Swinckels, Aug 19 '15 at 16:00
It's not efficient (as I put in the title of the post). It doesn't make sense to recreate the entire data structure every time I insert a new point, does it? There has to be a better way. The search time is O(log(N)), but the same complexity has a `std::map` because it's implemented as a black/red tree. I would be surprised if you couldn't do the same thing in Python. — aaragon, Aug 19 '15 at 17:09

Efficient lookup table for collection of numpy arrays

1 Answers1