What is the risk of collisions when relying on Python hash function?

Question

In my program I need to store data related to many (we talk hundreds of thousands, millions) game board states. For that I use a dict.

class BoardState(object):
    def __init__(self, ...):
        # ...
        self.board = [ [ None ] * self.cols for _ in xrange(self.rows) ]

    def __hash__(self):
        board_tuple = tuple([ tuple(row) for row in self.board ])
        return hash(board_tuple)

    # ...

self.board is a 2D list, in my main use case, with 6 rows and 7 columns.

At the beginning I indexed the dict with BoardState objects. But since I don't use BoardState objects stored in dict for other purpose than future lookup I noticed that I can save memory by indexing with hash(board_state) (this version uses 4 times less memory).

What is the chance that two different BoardState objects (with different boards inside) will result in the same value after hashing?

To clarify a bit, that's how I store and retrieve values from dict:

board_state = BoardState(...)
my_values[hash(board_state)] = { ... }
...
other_val_with_board_state = source_function()
retrieved = my_values[hash(other_val_with_board_state)]

(As I mentioned earlier, I index with result from hash() to save memory, since I don't use BoardState objects later.)

UPDATE Now I'm wondering if maybe using string representation of board_state.board as index would be a good solution to my problem.

Now I see what u are tying to do...Hard to say, there might be collisions. If you want to be more secure you should use more advanced hashing inside hashlib. Or define your own hashing which ensure unique results with regards to the board configuration. — Simone Zandara, Dec 17 '15 at 13:05
@Reti43 Yeah, it is somehow. But I don't want to waste memory space with objects I would use only for their __eq__ method. — Luke, Dec 17 '15 at 13:41
You could also just use the tuple of tuples (board_tuple) as a dictionary key, instead of its hash. No need to make a new class if the only interesting part is board_tuple. — Rob, Dec 17 '15 at 13:58
@Rob Board list / tuple is key to the state but there are also helper vars like rows and cols, and methods which operate on the state. Yes, I could also use that tuple as index but string would probably take less memory space. — Luke, Dec 17 '15 at 14:55

score 12 · Accepted Answer · answered Jun 19 '21 at 15:44

Short answer: use hashlib instead.

You shouldn't rely on hash if your program cannot handle the collisions or you want to save hash values or use multiprocessing.

Python hash function converts maps data to 64 bits (range of int). The most basic analysis of hashing is limited to viewing it as birthday problem. There is a nice SO answer and a detailed wiki page about that. Typical quote is "if you have less than billions of elements you should not worry". This is very simplistic view however.

As an anecdote: I've recently ran hash over 8.7e6 unique short strings created manually by humans. The mathematical expectation of the number of collisions for 64 bit hash is 4e-6. I got 32. Fun fact: hash(chr(9786)) == hash(chr(58)+chr(38)) ('☺' collides with ':&') (as of Python3.8.10).

Cryptographic functions from hashlib are way more resistant to collisions. Something like hashlib.sha256(pickle.dumps(my_obj,1)) might even be faster than conversion to tuples.

If the memory concern is the reason for hashing, one should first consider representing the data with lesser amount of bytes in the first place. Specifying __slots__ and reducing the number of the nested objects are first things that come to mind. However for the case of small objects this will be an uphill battle due to amount of scaffolding each python object needs.

If we take chess for example the full state can be stored in 24 bytes or more comfortably, in 32 bytes (64 cells each needing 4 bits to represent its contents). The best we can get with python is bytes which will take 65 bits (33 bytes of service info) and will require additional manipulations to push two 4-bit chunks into a single byte. The other alternative might be bitarray.frozenbitarray which will need 112 bytes to store the same amount of useful information (80 bytes of info). But hey, it still beats tuples within tuples, where there is 40 bytes of scaffolding per tuple.

score 0 · Answer 2 · answered Dec 17 '15 at 09:21

0

While I'm not sure what chance there is in getting the same value after hashing, presumably it is possible and could be problematic.

That being said, if you don't use BoardState objects stored in the dict for any purpose other than lookup, would it be feasible for you to add an id property to the BoardState class that is generated uniquely on __init__ (i.e. set to a global counter incremented by 1 after each new BoardState object is created)? Then you could use id as the key to your dictionary for future lookup and avoid any potential collision issues.

answered Dec 17 '15 at 09:21

kdawg

153
4

I can't do that, because it's possible that the same BoardState will be created in future and would't have the same id. – Luke Dec 17 '15 at 09:29
If 2 BoardState objects with same configuration are the same, why are you worrying of collisions? I do not quite get your use case. – Simone Zandara Dec 17 '15 at 09:33
@xbirkettx Am I asking if two different, let's say 6x7 boards (2D lists/tuples), could have the same hash. BTW If you don't mention me in the comment I don't get the notification since it's not my answer. – Luke Dec 17 '15 at 10:10
@Luke you have redefined your hash, but are you still using a dict for lookup? If so you don't have to worry about collisions – Simone Zandara Dec 17 '15 at 10:30
@xbirkettx But I'm doing lookup by the hash. So the values for those two states with collision would be shared between them. – Luke Dec 17 '15 at 12:50

score -1 · Answer 3 · answered Dec 17 '15 at 10:56

To know the risk of collision, we would have to take a look at the hash function implementation. The main idea is that there is a start from a space, let's say A (all the form that the variable board_tuple can take) to a other space B (result of hash function) through the hash function H.

The risk of collision come from two things :

the size of spaces : if you have 2 board_tuple possibilities and B has a size of 10⁶. Then there is really few chances that a collision will occur. In an other hand, if you can have 1000 board_tuple and H is resulting in a space B of 16, then it is almost for sure that collision will happen.
The hash function in itself. If the hash function is h(x) = 2, then there will always be collision.

However do not worry too much, hash function are well made and I am almost sure that they are handling collisions smartly with some classic strategies :

Do a rerun of the hash function until there is no collision
Store as an array of elements resulting in the same hash instead of having a collision ....

I'm afraid you don't answer the question. You just describe some facts about hashing and then conclude that we should look into Python's hash implementation to how whether and how frequently a collision can occur. Which is what the question is asking about. [This answer](http://stackoverflow.com/a/9010557/2243104) and links therein go into more detail about the specifics. — Reti43, Dec 17 '15 at 13:19

What is the risk of collisions when relying on Python hash function?

3 Answers3