2

In the HashMap implementation in Java, after computing the hash value of the key object, I guess that the function, hash(int hashvalue), is used on it to further randomize the hash values generated. As it does right shift, it randomizes the LSB bits more. Why don't they randomize the MSB too by performing left Shift?

Instead of doing like

  • h ^= (h >>> 20) ^ (h >>> 12); // Here the bits from 21 position bit from right are unchanged // Consider a case where the hash values of the inputs are in the range (2^21) -(2^22), the values will be distributed in the buckets with index from (2^21) -(2^22) and the remaining buckets in the hash map are unused.

  • return h ^ (h >>> 7) ^ (h >>> 4); // which randomizes the LSB part more.

Why can't they do something like

  • h ^= (h >>> 20) ^ (h >>> 12) ^ (h <<< 20) ^ (h <<< 12); // This randomizes both LSB bits and MSB bits

  • return h ^ (h >>> 7) ^ (h >>> 4) ^ (h <<< 7) ^ (h <<< 4);

As a result we will get a random number with random MSB and LSB bits .

Hubert Vijay
  • 89
  • 1
  • 6
  • 1
    You're asking the wrong people. Anything you get here will be guesswork unless one of the authors responds, which isn't likely. Not constructive. – user207421 Apr 14 '13 at 00:04
  • 3
    Hash values aren't randomized (by design). Also, the internal working of HashMap is completely different than what you desribed above. – Powerslave Apr 14 '13 at 00:04
  • @Powerslave: The hash Values are not randomized by design as you said. But we need randomized indexes to distribute the values in hash map equally to avoid collision. So we are applying the method, hash(int hashvalue), on the hash value obtained to randomize it further. – Hubert Vijay Apr 14 '13 at 00:27
  • 2
    Read the comments: "Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits. Note: Null keys always map to hash 0, thus index 0." – Hot Licks Apr 14 '13 at 00:29
  • 2
    And: "This function ensures that hashCodes that differ only by constant multiples at each bit position have a bounded number of collisions (approximately 8 at default load factor)." – Hot Licks Apr 14 '13 at 00:31
  • @HubertVijay Hash codes should not, in any case, be random. Hash collisions are dealt with by `HashMap` internally, by not only using `hashCode()` method, but also `equals(Object o)` on the key and keeping key-value pairs with identical key hash codes in a `List` dedicated to that hash. See [this page](http://howtodoinjava.com/2012/10/09/how-hashmap-works-in-java/) and/or [this question](http://stackoverflow.com/questions/6493605/how-does-java-hashmap-work) for the details. – Powerslave Apr 14 '13 at 00:36
  • @Powerslave: The key.equals(k) method is used to find the element with same key after hash Collision(multiple keys falls into same bucket) occurs. But I am curious about avoiding the collision by distributing the keys into different buckets. – Hubert Vijay Apr 14 '13 at 00:46
  • @HubertVijay That would pretty much circumvent the design of `HashMap`. If you want to do that, implement an `ExtendedHashMap` that uses `long` hashes (a new `extendedHash` method) instead of int or so. Hash codes are designed to be unique and *deterministic* for a given value, so they cannot be random. The only thing you can do is to choose a larger codespace. – Powerslave Apr 14 '13 at 00:51
  • @HotLicks: As you said, This function ensures the bounded number of collisions in each bucket. Assume there is a hash map with load factor 0.5 with 2^20 entries and 2^21 buckets. If the hash values generated from the inputs are mostly within the range 0 to 2^20, As we do rightShift and XORing, we randomize the LSB bits of the index.i.e. randomly distribute the values in buckets from 0 to 2^20. But the buckets from 2^20 to 2^21 are not used for the distribution at all. Only the values with hash value greater that 2^20 will be put in the buckets with index greater than 2^20. – Hubert Vijay Apr 14 '13 at 00:54
  • I think the authors were working with the practical knowledge that object hash functions tend to be deficient by having poorly distributed low-order bits. They were addressing a specific "real" problem that they had encountered, vs some theoretical problem. – Hot Licks Apr 14 '13 at 01:15
  • 2
    (But one does wonder why they didn't simply take the original hash value modulo some large prime, vs doing the bit-twiddling.) – Hot Licks Apr 14 '13 at 01:16

0 Answers0