1

I'm a beginner using Python and am trying to use the search function in in a dictionary to search for keys that are numpy arrays with the coordinates (2) of a point. So, what I want is: a dictionary whose keys are numpy arrays and whose values are integers. The in operator would be then used to compare for keys using some tolerance measure (numpy.allclose function). I understand that numpy arrays are not hashables so I would have to override the getitem and setitem functions (based on what I found in How to properly subclass dict and override __getitem__ & __setitem__). But how do I make these hashable to add them as keys in the dictionary? How do I override the behaviour of the in operator for this case?

Thanks for the help!

Community
  • 1
  • 1
  • You can’t make numpy arrays hashable since you are not the author of the type. You could create a subtype that inherits from numpy arrays and is hashable though. – poke Aug 18 '15 at 11:42
  • Can you tell us more about what you want to achieve with that datastructure? I bet there is a better way to do this with numpy. – swenzel Aug 18 '15 at 11:46
  • I need to check if a key (numpy array) that I'm trying to add is already in said array and only add that dict entry it if is not. – Gonçalo Valente Aug 18 '15 at 11:55
  • Could you give a minimal example with some sample data? – Cleb Aug 18 '15 at 12:01
  • Sure, let's say that I have a dict = {[0.5 0.5]: 13, [1. 1.5]:14} and that I have a new numpy array that is [0.5 0.5]. I want to search (efficiently) this dict dictionary for this 'new' key and only add the entry [0.5 0.5] (with a corresponding integer value) if it doesn't exist in dict. – Gonçalo Valente Aug 18 '15 at 12:07

3 Answers3

1

Numpy arrays are not hashable but tuples are. So you can hash the array if you turn it into a tuple. Theoretically, If you also round it beforehand you can take advantage of the fast lookup, because you now have discrete points. But you will get resolution problems during retranslating since rounding is done with decimal base but numbers are stored binary. It is possible to circumvent this by turning it into a scaled integer but that slows everything down a bit.

In the end you just need to write a class that translates back and forth between arrays and tuples on the fly and you're good to go.
An implementation could look like this:

import numpy as np

class PointDict(dict):

    def __init__(self, precision=5):
        super(PointDict, self).__init__()
        self._prec = 10**precision

    def decode(self, tup):
        """
        Turns a tuple that was used as index back into a numpy array.
        """
        return np.array(tup, dtype=float)/self._prec

    def encode(self, ndarray):
        """
        Rounds a numpy array and turns it into a tuple so that it can be used
        as index for this dict.
        """
        return tuple(int(x) for x in ndarray*self._prec)

    def __getitem__(self, item):
        return self.decode(super(PointDict, self).__getitem__(self.encode(item)))

    def __setitem__(self, item, value):
        return super(PointDict, self).__setitem__(self.encode(item), value)

    def __contains__(self, item):
        return super(PointDict, self).__contains__(self.encode(item))

    def update(self, other):
        for item, value in other.items():
            self[item] = value

    def items(self):
        for item in self:
            yield (item, self[item])

    def __iter__(self):
        for item in super(PointDict, self).__iter__():
            yield self.decode(item)

When looking up a lot of points, a pure numpy solution with vectorized batch write/lookup might be better. This solution, however, is easy to understand and to implement.

swenzel
  • 6,745
  • 3
  • 23
  • 37
  • Beware of any answer containing the phrase "you just need to" ... :) – holdenweb Aug 18 '15 at 14:03
  • Can you tell me what you mean by "vectorized batch write/lookup"? Where can I read something on this? Thanks! – Gonçalo Valente Aug 18 '15 at 14:34
  • @GonçaloValente What I mean is that you look up or update multiple points at the same time. Numpy is fastest when you can vectorize operations which means indexing with arrays and assigning or changing multiple entries at once. Problem here would be to design a clever way to save the items along with their values so that we can take advantage of that. You won't get O(1) lookup but O(log(n)) might be possible... – swenzel Aug 18 '15 at 19:51
0

Instead of a numpy array, use a 2-tuple of floats as the key. Tuples are hashable since they are immutable.

Python dictionaries use a hash-table in the background to make key lookup fast.

Writing a closeto function isn't that hard;

def closeto(a, b, limit=0.1):
    x, y = a
    p, q = b
    return (x-p)**2 + (y-q)**2 < limit**2

And this can be used to do find points that are close. But then you have to iterate over all keys because key lookup is exact. But if you do this iteration in a comprehension, it is much faster than it for-loop.

Testing (in IPython, with Python 3):

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:    def closeto(a, b, limit=0.1):
:        x, y = a
:        p, q = b
:        return (x-p)**2 + (y-q)**2 < limit**2
:--

In [2]: d = {(0.0, 0.0): 12, (1.02, 2.17): 32, (2.0, 4.2): 23}

In [3]: {k: v for k, v in d.items() if closeto(k, (1.0, 2.0), limit=0.5)}
Out[3]: {(1.02, 2.17): 32}
Roland Smith
  • 42,427
  • 3
  • 64
  • 94
  • I need to use numpy arrays, so I can't change that. – Gonçalo Valente Aug 18 '15 at 12:25
  • It seems in order to find if an object is in this dictionary you have to go through all of them, and isn't the purpose of using a dictionary to have fast access? Isn't there a smart way to hash a numpy array? – Gonçalo Valente Aug 18 '15 at 12:38
  • @GonçaloValente The reason that Python uses hashable keys is to make checking if a key exists in a dict (the `in` operator) faster by using a [hash-table](https://en.wikipedia.org/wiki/Hash_table) in the background. So you don't have to go through all items. – Roland Smith Aug 18 '15 at 12:53
0

Convert the arrays to tuples, which are hashable:

In [18]: a1 = np.array([0.5, 0.5])

In [19]: a2 = np.array([1.0, 1.5])

In [20]: d = {}

In [21]: d[tuple(a1)] = 14

In [22]: d[tuple(a2)] = 15

In [23]: d
Out[23]: {(0.5, 0.5): 14, (1.0, 1.5): 15}

In [24]: a3 = np.array([0.5, 0.5])

In [25]: a3 in d
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-07c81d61b999> in <module>()
----> 1 a3 in d

TypeError: unhashable type: 'numpy.ndarray'

In [26]: tuple(a3) in d
Out[26]: True

Unfortunately, since you want to apply a tolerance to the comparison, you don't have much option but to iterate over all the keys looking for a "close" match, whether you implement this as a function or in-line.

holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • So that would work except for the fact that I need to compare for membership using the allcose numpy function, which tests arrays within a tolerance value. There must be a smarter way to do it. Would overriding the __contains__ function prevent having that TypeError that you show when asking if a3 is in d? – Gonçalo Valente Aug 18 '15 at 12:28
  • The fact remains that however you implement it you are going to have to apply `allclose` to the new key and all existing keys until you find one close enough or run off the end of the keys (if no match is found). – holdenweb Aug 18 '15 at 12:33
  • But the point of using a dictionary is to be able to find elements quickly, isn't it? I would like to find it in constant time using a hashing function instead of a linear time. – Gonçalo Valente Aug 18 '15 at 12:35
  • 1
    Of course you would, but dicts use *exact* matching of keys. So you need to define a hashing function that yields the same value for all "close" values of your points. Good luck with that - I'm not saying it's impossible, but I'm not going to try it. – holdenweb Aug 18 '15 at 12:38
  • One possible transformation that might help is to convert the key values to points on some notional grid before applying the hashing, I suppose, but that wouldn't give you the same results as `allclose`. – holdenweb Aug 18 '15 at 12:40
  • Isn't there something like a C++ `std::map` in Python? A `std::map` is a binary search tree where you can overload `operator<` for your array and get this "almost exact" functionality. I would be surprised if Python doesn't have something like that for a quick lookup of array of floating point values. – aaragon Aug 18 '15 at 12:48