12

In java 8 java.util.Hashmap I noticed a change from:

static int hash(int h) {
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);

to:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);

It appears from the code that the new function is a simpler XOR of the lower 16 bits with the upper 16 leaving the upper 16 bits unchanged, as opposed to several different shifts in the previous implementation, and from the comments that this is less effective at allocating the results of hash functions with a high number of collisions in lower bits to different buckets, but saves CPU cycles by having to do less operations.

The only thing I saw in the release notes was the change from linked lists to balanced trees to store colliding keys (which I thought might have changed the amount of time it made sense to spend calculating a good hash), I was specifically interested in seeing if there was any expected performance impact from this change on large hash maps. Is there any information about this change, or does anyone with a better knowledge of hash functions have an idea of what the implications of this change might be (if any, perhaps I just misunderstood the code) and if there was any need to generate hash codes in a different way to maintain performance when moving to Java 8?

user207421
  • 305,947
  • 44
  • 307
  • 483
MilesHampson
  • 2,069
  • 24
  • 43

3 Answers3

9

As you noted: there is a significant performance improvement in HashMap in Java 8 as described in JEP-180. Basically, if a hash chain goes over a certain size, the HashMap will (where possible) replace it with a balanced binary tree. This makes the "worst case" behaviour of various operations O(log N) instead of O(N).

This doesn't directly explain the change to hash. However, I would hypothesize that the optimization in JEP-180 means that the performance hit due to a poorly distributed hash function is less important, and that the cost-benefit analysis for the hash method changes; i.e. the more complex version is less beneficial on average. (Bear in mind that when the key type's hashcode method generates high quality codes, then gymnastics in the complex version of the hash method are a waste of time.)

But this is only a theory. The real rationale for the hash change is most likely Oracle confidential.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • When the size of the 'hash chain' goes above a limit, it shifts to 'balanced tree' from linked lists to be specific. Thereby, the worst case operations take O(n) time instead of O(n). – darkdefender27 Aug 13 '17 at 11:55
  • @darkdefender27 - your comment doesn't make sense. 1) How is O(n) better than O(n) ? 2) It actually goes to O(logn) ! 3) That's what I said in my answer ... – Stephen C Aug 13 '17 at 12:25
  • Oh I'm sorry, I meant O(log n). Your answer completely makes sense. I was just trying to state that it switches to 'Balanced' Binary Search Tree. – darkdefender27 Aug 14 '17 at 10:17
  • I've added that. Thanks. – Stephen C Aug 14 '17 at 14:46
  • I wonder how bad a hashcode should be (except a constant, of course) to gain any real profit from balanced tree? Even if all objects get into one bucket, sooner or later (due to load-factor) the underlying array would grow and the objects would get re-distributed again. To me, in one bucket, there should be 1000+ objects to make any use of tree. But in a real world app, is it the case? – Andreas Gelever Sep 07 '17 at 09:33
  • 1
    Where do you get the 1000+ figure from? Anyhow, one real-world use case is when the hash table keys are produced by a client code, and someone (a hacker) is deliberately manufacturing keys with the same hashcode in order fill a bucket and *thereby* implement a form of denial-of-service attack! – Stephen C Sep 07 '17 at 10:09
2

When I ran hash implementation diffences I see time difference in nano seconds as below (not great but can have some effect when the size is huge ~1million+)-

7473 ns – java 7

3981 ns– java 8

If we are talking about well formed keys and hashmap of big size (~million), this might have some impact and this is because of simplified logic.

Nrj
  • 6,723
  • 7
  • 46
  • 58
0

As Java documentation says that idea is to handle the situation where old implementation of Linked list performs O(n) instead of O(1). This happens when same hash code is generated for large set of data being inserted in HashMap.

This is not normal scenario though. To handle a situation that once the number of items in a hash bucket grows beyond a certain threshold, that bucket will switch from using a linked list of entries to a binary tree. In the case of high hash collisions, this will improve search performance from O(n) to O(log n) which is much better and solves the problem of performance.

AmitG
  • 519
  • 6
  • 19