5

I don't see this approach avoid the collision. I think if the key.hashcode is larger than table.length, there will be collision.

Updates: Actually I refer to the HashMap#hash in JDK 1.8, and was a little confused about the benifit of spread the higher bits downward. Now, I think I'm clear with the help of this link, the benifits is:

  • We don't need to do the % calculation, but using a more faster way - bit shift.

For the collision, if the number of key is larger than the length of the table, then there will be a collision no matter what hash method is used.

caisil
  • 363
  • 2
  • 14
  • There's not really enough context for us to understand what you're asking. But indeed, if the key space is larger than your table, you're guaranteed to get collisions sometimes. – Horia Coman Jul 16 '17 at 05:58
  • The code in the title is a newer version of `HashMap#hash` but the explanation of the dupe still works. – Tom Jul 16 '17 at 07:25
  • Please provide code so we could reproduce your issue. – Malakai Jul 16 '17 at 07:53
  • Java uses polynomial hashing everywhere. This has the unfortunate result that the lower N bits of a hash only depend on the lower N bits of all the data items being hashed. In a hashmap of even integers, for example, using a power-of-two table length and masking to generate the bucket index would leave *half the buckets in the hash table completely unused!* Such patterns in the data are common, so this is a real problem. – Matt Timmermans Jul 16 '17 at 13:54

3 Answers3

4

Let's say you naively index into a hashtable using

int index = hashcode % table.length;

This may lead to many collisions in some common use cases. For example, assume table.length is a small power of two (like 32 or 64). In this case only the low order bits of the hashcode determine the index. This will cause lots of collisions if your object's hashcode only differ in the upper bits. The bit shift allows the upper bits of the hashcode to also influence the computed index.

Evan Darke
  • 604
  • 4
  • 7
  • "This will cause lots of collisions if your object's hashcode only differ in the upper bits." This scenario is right, but what if for the scenario that lower bits are different. I think by shifting the upper 8 bit downward, the main purpose of JDK programmer is not trying to avoid collision but promote calculation speed. – caisil Jul 16 '17 at 13:28
  • 1
    @caisil As Tom pointed out, this code comes from JDK8's HashMap. The Javadoc comment states the purpose of the function is to 'spread the impact of the higher bits downward' and that the quality of spreading comes at the cost of speed. To be clear, the index is now computed as (hashcode ^ (hashcode >>> 16)) % table.length; So it is strictly slower than the naive calculation. – Evan Darke Jul 16 '17 at 17:26
  • No, I don't think so. From the `putVal` of HashMap, we can see the index is represented by `(n - 1) & hash`, `hash ` is equal to `hashcode ^ (hashcode >>> 16)`. These are all about bit operation no modulo, and [mudulo is slower than bitwise operation](http://www.theeggeadventure.com/wikimedia/index.php/And_vs_Modulo_performance). – caisil Jul 17 '17 at 02:46
  • 2
    @caisil Yes, I'm aware. hash % table.length is mathematically equivalent to (n - 1) & hash where n = table.length and table.length is a power of two. But the bitshift and XOR that OP asked about has nothing to do with it. The bitshifting is a common technique used to improve the quality of poor hash functions. – Evan Darke Jul 17 '17 at 08:10
4

The reason for that is in the comments:

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.)

What that is saying in simple words is that Key#hashcode (last bits that we care about) would be the same for Keys that are actually different. And that would create collisions since these entries will end-up in the same bucket.

Where an Entry goes is decided based on the number of existing buckets or from the last n - bits as you have already seen from :

int index = (n - 1) & hash

If hashmap would not re-hash again - it means that entries that don't differ in those last bits would end up in the same bucket, search time == slower.

The reason that XOR is used - because it has 50/50% distribution of 1 and 0 (as opposed to | or & that have 75/25 or 25/75).

And the & operation is used instead of %, not for speed only, but because hashcodes are ints and can be negative. modulo on a negative number will be negative - meaning a negative bucket... Thus & is used which will generate a positive index.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • As you say, "The reason that XOR is used - because it has 50/50% distribution of 1 and 0". Since to acheive nice distributed hash function, why not choose a random 32 bit like 101001....11 do XOR with the original key.hashcode, just curious. By spreading the higher 16 bit downward in JDK, the higher bit will be omitted if the table.length is larger 2^16. – caisil Jul 17 '17 at 22:22
  • @caisil what do you mean they will be omitted? even if there are more then 2 pow 16 buckets, the bits to decide the bucket are still taken into consideration, I don't get it... and about the random number, if you're hashcode has the last 16 bits the same, XOR with a random number will yield the same result same result (last 16 bits) for all those entries and obviously you don't want that. – Eugene Jul 18 '17 at 08:12
0

Using h ^ (h > > > 16) shifts the higher-order bits in the hashCode, to the right and spreads the effect to the lower bits using XOR operation so that those actually participate in the index calculation logic and eventually helps in avoiding collisions.
It is clearly explained in this link with examples: https://jvmaware.com/hashcode-calculation/

Zied Yazidi
  • 385
  • 3
  • 9