4

I have a bunch of dictionaries in python, each dictionary containing user information e.g.:

NewUserDict={'name': 'John', 'age':27}

I collect all these user info dictionaries within a larger dictionary container, using the hash value of each dictionary as the key (Hashing a dictionary?).

What would be the best way to handle hash collisions, when adding new unique users to the dictionary? I was going to manually compare the dictionaries with colliding hash values, and just add some random number to the more recent hash value, e.g.:

if new_hash in larger_dictionary:
    if larger_dictionary[new_hash] != NewUserDict:
        new_hash = new_hash + somerandomnumber

What would be standard way of handling this? Alternatively, how do I know if I should be worrying about collisions in the first place?

Community
  • 1
  • 1
wenhoo
  • 352
  • 1
  • 4
  • 14

3 Answers3

4

Generally, you would use the most unique element of your user record. And this usually means that the system in general has a username or a unique ID per record (user), which is guaranteed to be unique. The username or ID would be the unique key for the record. Since this is enforced by the system itself, for example by means of an auto-increment key in a database table, you can be sure that there is no collision.

THAT unique key therefore should be the key in your map to allow you to find a user record.

However, if for some reason you don't have access to such a guranteed-to-be-unique key, you can certainly create a hash from the record (as described by you) and use any of a number of hash table algorithms to store elements with possibly colliding keys. In that case, you don't avoid the collision, but you simply deal with it.

A quick and commonly used algorithm for that goes like this: Use a hash over the record to create a key, as you already do. This key may potentially not be unique. Now store a list of records at the location indicated by the key. We call those lists 'buckets'. To store a new element, hash it and then append it to the list stored at that location (add it to the bucket). To find an element, hash it, find the entry, then sequentially search through the list/bucket at that location to find the entry you want.

Here's an example:

mymap[123] = [ {'name':'John','age':27}, {'name':'Bob','age':19} ]
mymap[678] = [ {'name':'Frank','age':29} ]

In the example, you have your hash table (implemented via a dict). You have hash key value 678, for which one entry is stored in the bucket. Then you have hash key value 123, but there is a collision: Both the 'John' and 'Bob' entry have this hash value. No matter, you find the bucket stored at mymap[123] and iterate over it to find the value.

This is a flexible and very common implementation of hash maps, which doesn't require re-allocation or other complications. It's described in many places, for example here: https://www.cs.auckland.ac.nz/~jmor159/PLDS210/hash_tables.html (in chapter 8.3.1).

Performance generally only becomes an issue when you have lots of collisions (when the lists for each bucket get really long). Something you'll avoid with a good hash function.

However: A true unique ID for your record, enforced by the database for example, is probably still the preferred approach.

jbrendel
  • 2,390
  • 3
  • 18
  • 18
  • This exactly addresses my question. I wanted to hash over the record to create the key i.e. ``dict[hash]=record``, so that when I am adding a new record, I can quickly check for duplicates with ``if hash in dict``. As described similarly in the reference provided. Your suggestion should work for me, thank you! – wenhoo Feb 08 '17 at 13:54
3

using the hash value of each dictionary as the key

You are not using the hash value of a dict. Dicts don't have hash values. From the link, it looks like you're using

hash(frozenset(my_dict.items()))

in which case you should instead just be using

frozenset(my_dict.items())

as the key directly. Hash collisions will then be handled for you by the normal dict collision handling.


In general, you should not use hashes as dict keys, as doing so defeats collision resolution. You should use whatever hashed to that hash value as the key.

user2357112
  • 260,549
  • 28
  • 431
  • 505
1

In general, collision happens when multiple keys hash to the same bucket. In that case, we need to make sure that we can distinguish between those keys.

Chaining collision resolution is one of the popular techniques which is used for collision resolution for hash tables. For example, two strings "welcome to stackoverflow" and "how to earn reputation in SO?" yield hash codes 100 and 200 respectively. Assuming the total array size is 10, both of them end up in the same bucket (100 % 10 and 200 % 10). Another approach is Open Addressing to resolve collision while hashing.

You can read this article on Python Dictionary Implementations which talks about handling collision because python dictionaries are implemented using hash tables.

Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161