0

I am currently implementing a Hashed Array-Mapped Table (HAMT) in Java and I've run into the following issue. When inserting new key-value pairs, there will obviously be collisions. In the original paper, the author suggests:

The existing key is then inserted in the new sub-hash table and the new key added. Each time 5 more bits of the hash are used the probability of a collision reduces by a factor of 1/32. Occasionally an entire 32 bit hash may be consumed and a new one must be computed to differentiate the two keys.

... and also:

The hash function was tailored to give a 32 bit hash. The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 2^32 of a further collision.

So i tried this in Java, with strings. It is known that:

"Ea".hashCode() == "FB".hashCode(); // true

... because of the String#hashCode() algorithm. Following the suggestion in the paper and extending the string with the tree depth to yield another, non-colliding hash code sadly doesn't work:

"Ea1".hashCode() == "FB1".hashCode();  // :(  still the same!!

The above holds true for any integer you might concatenate the strings with, their hash codes will always collide.

My question is: how do you solve this situation? There has been this answer to a very similar question, but there has not been a real solution in the discussion. So how do we do this...?

Community
  • 1
  • 1
Martin Häusler
  • 6,544
  • 8
  • 39
  • 66

1 Answers1

1

You have to implements equals() method to compare if the values are equals.

Hashcode is just to sort data into data collections and it is usefull to binarySearch works. But hashcode() is nothing without equals().

Paulo
  • 2,956
  • 3
  • 20
  • 30
  • 1
    For the tree, this would mean: exhaust all 32 bits of the hash code, and if there is still a collision, store the conflicting entries in an array and scan through it linearly, checking for ´entry.key().equals(searchKey)´. This would work, but it seems to me that this is not what the paper author intended... – Martin Häusler Oct 20 '16 at 19:27
  • ... on second thought, you are right. Even if there was an alternative way of calculating the hash code, there is no way to do it for any key object. So the solution is: follow the algorithm from the paper, if there is a collision in the hash code take the next 5 bits and repeat. Should the hash code be exhausted and there is still a collision, store the conflicting key-value pairs in one node in an array. When doing the lookup, iterate over it linearly. These are rare corner cases anyways, no need to optimize them too much. – Martin Häusler Oct 20 '16 at 21:09