114

As per the following link document: Java HashMap Implementation

I'm confused with the implementation of HashMap (or rather, an enhancement in HashMap). My queries are:

Firstly

static final int TREEIFY_THRESHOLD = 8;
static final int UNTREEIFY_THRESHOLD = 6;
static final int MIN_TREEIFY_CAPACITY = 64;

Why and how are these constants used? I want some clear examples for this. How they are achieving a performance gain with this?

Secondly

If you see the source code of HashMap in JDK, you will find the following static inner class:

static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> {
    HashMap.TreeNode<K, V> parent;
    HashMap.TreeNode<K, V> left;
    HashMap.TreeNode<K, V> right;
    HashMap.TreeNode<K, V> prev;
    boolean red;

    TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) {
        super(arg0, arg1, arg2, arg3);
    }

    final HashMap.TreeNode<K, V> root() {
        HashMap.TreeNode arg0 = this;

        while (true) {
            HashMap.TreeNode arg1 = arg0.parent;
            if (arg0.parent == null) {
                return arg0;
            }

            arg0 = arg1;
        }
    }
    //...
}

How is it used? I just want an explanation of the algorithm.

Andrew Tobilko
  • 48,120
  • 14
  • 91
  • 142
Hasnain Ali Bohra
  • 2,130
  • 2
  • 11
  • 25

6 Answers6

265

HashMap contains a certain number of buckets. It uses hashCode to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.

If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0 so the item goes in the first bucket, Bucket 1.

HashMap

If our hashCode function is good, it should provide an even distribution so that all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.

Linked Buckets

But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.

Bad hashmap

The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.

The implementation of HashMap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8 is for. If a bucket contains more than eight items, it should become a tree.

Tree Bucket

This tree is a Red-Black tree, presumably chosen because it offers some worst-case guarantees. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo method of Comparable if the objects implement that interface, else the identity hash code.

If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6 is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.

Finally, there is the MIN_TREEIFY_CAPACITY = 64.

When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small HashMap, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger HashMap, with more buckets that are less full. This constant basically says not to start making buckets into trees if our HashMap is very small - it should resize to be larger first instead.


To answer your question about the performance gain, these optimisations were added to improve the worst case. You would probably only see a noticeable performance improvement because of these optimisations if your hashCode function was not very good.

It is designed to protect against bad hashCode implementations and also provides basic protection against collision attacks, where a bad actor may attempt to slow down a system by deliberately selecting inputs which occupy the same buckets.

Michael
  • 41,989
  • 11
  • 82
  • 128
  • 1
    Could you please help me understand why java chooses 8 as a threshold to treeify, why not some other number like 5 or 10 or any other number? Also what is the reason for using a linked list initially and when a number of elements cross 8 and then changes to the tree, why not use a tree instead of a linked list initially – Vishwas Tyagi Jun 07 '22 at 12:25
  • @VishwasTyagi The chosen threshold was likely determined by trial and error. They likely tried plenty of thresholds and determined experimentally that 8 was best. The reason that trees aren't used initially is because trees incur overhead which is not worth incurring if the number of elements is low. – Michael Jun 07 '22 at 13:28
  • Another major reason for RBT to be used is its "BALANCED TREE". So the skewed tree scenarios won't be occurring and the worst case of fetching element from the tree would be O(log(N)). – Mandar Autade Aug 02 '22 at 18:31
  • @VishwasTyagi The reason for using a linked list is that adding an element to the head of a linked list is cheap and guaranteed to be O(1). Adding elements to other types of containers doesn't always have this performance. For example, adding element to a full array requires resizing that array first. Adding an element to an AVL tree is O(log n). – Aetherus Jul 15 '23 at 02:10
19

To put it simpler (as much as I could simpler) + some more details.

These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.

TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:

it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.

Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:

 int arrayIndex = hashCode % buckets; // will be negative

 buckets[arrayIndex] = Entry; // obviously will fail

Instead there is a nice trick used instead of modulo:

 (n - 1) & hash // n is the number of bins, hash - is the hash function of the key

That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:

Map<String, String> map = new HashMap<>();

In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.

This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.

So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.

As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...

Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD. At this point it does not pay off to keep the bin as red-black tree node, but as a LinkedList instead, like

 entry.next.next....

MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.

nikiforovpizza
  • 487
  • 1
  • 7
  • 13
Eugene
  • 117,005
  • 15
  • 201
  • 306
10

TreeNode is an alternative way to store the entries that belong to a single bin of the HashMap. In older implementations the entries of a bin were stored in a linked list. In Java 8, if the number of entries in a bin passed a threshold (TREEIFY_THRESHOLD), they are stored in a tree structure instead of the original linked list. This is an optimization.

From the implementation:

/*
 * Implementation notes.
 *
 * This map usually acts as a binned (bucketed) hash table, but
 * when bins get too large, they are transformed into bins of
 * TreeNodes, each structured similarly to those in
 * java.util.TreeMap. Most methods try to use normal bins, but
 * relay to TreeNode methods when applicable (simply by checking
 * instanceof a node).  Bins of TreeNodes may be traversed and
 * used like any others, but additionally support faster lookup
 * when overpopulated. However, since the vast majority of bins in
 * normal use are not overpopulated, checking for existence of
 * tree bins may be delayed in the course of table methods.
Eran
  • 387,369
  • 54
  • 702
  • 768
  • not *exactly* true. If they pass `TREEIFY_THRESHOLD` *AND* the total number of bins is at least `MIN_TREEIFY_CAPACITY`. I've tried to cover that in my answer... – Eugene May 11 '17 at 20:21
3

You would need to visualize it: say there is a Class Key with only hashCode() function overridden to always return same value

public class Key implements Comparable<Key>{

  private String name;

  public Key (String name){
    this.name = name;
  }

  @Override
  public int hashCode(){
    return 1;
  }

  public String keyName(){
    return this.name;
  }

  public int compareTo(Key key){
    //returns a +ve or -ve integer 
  }

}

and then somewhere else, I am inserting 9 entries into a HashMap with all keys being instances of this class. e.g.

Map<Key, String> map = new HashMap<>();

    Key key1 = new Key("key1");
    map.put(key1, "one");

    Key key2 = new Key("key2");
    map.put(key2, "two");
    Key key3 = new Key("key3");
    map.put(key3, "three");
    Key key4 = new Key("key4");
    map.put(key4, "four");
    Key key5 = new Key("key5");
    map.put(key5, "five");
    Key key6 = new Key("key6");
    map.put(key6, "six");
    Key key7 = new Key("key7");
    map.put(key7, "seven");
    Key key8 = new Key("key8");
    map.put(key8, "eight");

//Since hascode is same, all entries will land into same bucket, lets call it bucket 1. upto here all entries in bucket 1 will be arranged in LinkedList structure e.g. key1 -> key2-> key3 -> ...so on. but when I insert one more entry 

    Key key9 = new Key("key9");
    map.put(key9, "nine");

  threshold value of 8 will be reached and it will rearrange bucket1 entires into Tree (red-black) structure, replacing old linked list. e.g.

                  key1
                 /    \
               key2   key3
              /   \   /  \

Tree traversal is faster {O(log n)} than LinkedList {O(n)} and as n grows, the difference becomes more significant.

rentedrainbow
  • 184
  • 2
  • 10
  • It can't possibly build an efficient tree because it has no way to compare keys other than their hashcodes, which are all the same, and their equals method, which doesn't help with ordering. – user253751 May 11 '17 at 11:55
  • @immibis Their hashcodes are not necessarily the same. They're quite likely different. If the classes implement it, it will additionally use `compareTo` from `Comparable`. `identityHashCode` is another mechanism it uses. – Michael May 11 '17 at 12:02
  • @Michael In this example all the hashcodes are necessarily the same and the class does not implement Comparable. identityHashCode will be worthless in finding the correct node. – user253751 May 11 '17 at 12:07
  • @immibis Ah yes, I only skimmed it but you're right. So, as `Key` does not implement `Comparable`, `identityHashCode` will be used :) – Michael May 11 '17 at 12:09
  • @EmonMishra unfortunately, *simply* to visual will not be enough, I've tried to to cover that in my answer. – Eugene May 11 '17 at 20:22
  • `static final int MIN_TREEIFY_CAPACITY = 64;`, where does this variable comes into picture. As per above answer, buckets would not be converted to tree, unless number of buckets has reached `static final int MIN_TREEIFY_CAPACITY = 64;` – Nisarg Patil Apr 29 '18 at 09:30
2

The change in HashMap implementation was was added with JEP-180. The purpose was to:

Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class

However pure performance is not the only gain. It will also prevent HashDoS attack, in case a hash map is used to store user input, because the red-black tree that is used to store data in the bucket has worst case insertion complexity in O(log n). The tree is used after a certain criteria is met - see Eugene's answer.

Community
  • 1
  • 1
Anton Krosnev
  • 3,964
  • 1
  • 21
  • 37
-1

To understand the internal implementation of hashmap, you need to understand the hashing. Hashing in its simplest form, is a way to assigning a unique code for any variable/object after applying any formula/algorithm on its properties.

A true hash function must follow this rule –

“Hash function should return the same hash code each and every time when the function is applied on same or equal objects. In other words, two equal objects must produce the same hash code consistently.”