30

How does Java 8's HashMap degenerate to balanced trees when many keys have the same hash code? I read that keys should implement Comparable to define an ordering. How does HashMap combine hashing and natural ordering to implement the trees? What about classes that do not implement Comparable, or when multiple, non-mutually-comparable Comparable implementations are keys in the same map?

Jeffrey Bosboom
  • 13,313
  • 16
  • 79
  • 92
Thennam
  • 309
  • 1
  • 3
  • 4

3 Answers3

34

The implementation notes comment in HashMap is a better description of HashMap's operation than I could write myself. The relevant parts for understanding the tree nodes and their ordering are:

This map usually acts as a binned (bucketed) hash table, but when bins get too large, they are transformed into bins of TreeNodes, each structured similarly to those in java.util.TreeMap. [...] Bins of TreeNodes may be traversed and used like any others, but additionally support faster lookup when overpopulated. [...]

Tree bins (i.e., bins whose elements are all TreeNodes) are ordered primarily by hashCode, but in the case of ties, if two elements are of the same "class C implements Comparable" type then their compareTo method is used for ordering. (We conservatively check generic types via reflection to validate this -- see method comparableClassFor). The added complexity of tree bins is worthwhile in providing worst-case O(log n) operations when keys either have distinct hashes or are orderable, Thus, performance degrades gracefully under accidental or malicious usages in which hashCode() methods return values that are poorly distributed, as well as those in which many keys share a hashCode, so long as they are also Comparable. (If neither of these apply, we may waste about a factor of two in time and space compared to taking no precautions. But the only known cases stem from poor user programming practices that are already so slow that this makes little difference.)

When two objects have equal hash codes but are not mutually comparable, method tieBreakOrder is invoked to break the tie, first by string comparison on getClass().getName() (!), then by comparing System.identityHashCode.

The actual tree building starts in treeifyBin, beginning when a bin reaches TREEIFY_THRESHOLD (currently 8), assuming the hash table has at least MIN_TREEIFY_CAPACITY capacity (currently 64). It's a mostly-normal red-black tree implementation (crediting CLR), with some complications to support traversal in the same way as hash bins (e.g., removeTreeNode).

Community
  • 1
  • 1
Jeffrey Bosboom
  • 13,313
  • 16
  • 79
  • 92
  • I agree with Jeffrey's answer, but interesting thing around this. identityHashCode gives the hashCode() which is irrespective of Objects overridden hashcode (make sense as hash code already collided). – Rohit Sachan Jan 06 '16 at 16:53
  • 1
    But if hashcode() is not being overridden in object then it will anyway will call System.identityHashCode() to find hashbucket. So what will happen in case key doesn't override hashcode() and tieBreakOrder been called? or will it ever be called ? – Rohit Sachan Jan 06 '16 at 17:00
  • 2
    @RohitSachan If the key type is using the default identity semantics, it's highly unlikely that any bucket will become full enough to convert to a tree bucket, because identity hash codes are well-distributed. But there's no special handling for that case, so yes, `tieBreakOrder` will still call `identityHashCode` for the ordering. – Jeffrey Bosboom Jan 06 '16 at 20:48
  • System.identityHashcode returns an int, can't be very sure that it will never collide. What happens then ? – Rohit Sachan Jan 06 '16 at 23:23
  • 1
    @Rohit Sachan: if every optimization attempt fails, it has to resort to the same linked list behavior known from versions prior to Java 8. – Holger Jan 30 '17 at 18:20
  • Why is it safe to use `System.identityHashCode`? If you make a new key object that matches a key in the map via `equals`, it will have a completely different hash code in this case. Wouldn't that make it impossible to find the key? – Nate Glenn Aug 10 '17 at 09:57
  • 1
    @NateGlenn [`find`](http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/a006fa0a9e8f/src/share/classes/java/util/HashMap.java#l1842) seems to look both ways in the non-comparable case (the last else-if and else), but I'd need to step through in a debugger to be sure. That raises the question of why an order is necessary at all in that case, though. – Jeffrey Bosboom Aug 10 '17 at 10:07
  • I see, yes that would seem to fix it. Care to speculate?: why is there a shortcut case for `String` in `comparableClassFor`, but not for any other obvious interfaces like `Number`? For that matter, `String` is controlled by the JDK and has a good `hashCode` implementation. Is this perhaps for adversarial subclasses of `String` with purposely bad `hashCode` implementations? – Nate Glenn Aug 10 '17 at 11:38
  • 1
    @NateGlenn Speculation: String is the most common JDK-controlled type. Each additional class check gives less benefit in the special cases and increases the cost in the general case, and maybe just String is the optimal tradeoff. Note that String is final, so it's not for subclass defense. It's probably to prevent DoS attacks based on hash collisions; the String hashCode has not always been the best. There was at some point a big list of strings that hashed to MIN_VALUE, including "polygenelubricants". I imagine collisions at other values are also possible. – Jeffrey Bosboom Aug 10 '17 at 20:21
  • @Jeffrey Bosboom: since strings have a practically unlimited value space (more than 65536²¹⁴⁷⁴⁸³⁶⁴⁷ different strings), hash code collisions (only 2³² different hash codes) are unavoidable. And it’s a sign of a good hash code, if the collisions are distributed over all possible hash codes as well. So yes, collisions at other values *are* also possible. The problem with `String`’s hash code is not its quality but the fact that the algorithm is part of the specification (e.g. `switch` over strings relies on it), so it can’t be changed between implementations and attackers can predict collisions. – Holger Sep 13 '17 at 07:47
2

Read the code. It is mostly a red-black tree.

It does not actually require the implementation of Comparable, but can use it if available (see for instance the find method)

Frédéric Dumont
  • 958
  • 13
  • 19
1

HashMap has it's own hash method that applies a supplemental 2 bit lenght hash to the objects inside in order to avoid this problems:

Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits. Note: Null keys always map to hash 0, thus index 0.

If you want to see how it's done, take a look is inside the source of the HashMap class.

static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}
Jordi Castilla
  • 26,609
  • 8
  • 70
  • 109
  • 2
    Actually, no. The `hash` function reduces but does not eliminate collisions, so `HashMap` still has to manage keys with the same hash value. – Frédéric Dumont May 11 '15 at 09:50
  • You have linked to an OLD version of the source code. Look at the Java 8 code and you will see that it no longer does the supplemental rehashing. That method no longer exists. – Stephen C May 11 '15 at 10:16
  • static final int hash(Object key) - It is exists. (JDK 16) – skyho Apr 08 '22 at 06:02