8

I was trying to understand the that does the rehashing of hashmap takes place while exceeding the number of buckets occupied or the total number of entries in all buckets. Means, we know that if 12 out of 16 (One entry in each bucket) of buckets are full (Considering default loadfactor and initial capacity), then we know on the next entry the hashmap will be rehashed. But what about that case when suppose only 3 buckets are occupied with 4 entries each (Total 12 entries but only 3 buckets out of 16 in use)?

So I tried to replicate this by making the worst hash function which will put all entries in a single bucket.

Here is my code.

class X {

    public Integer value;

    public X(Integer value) {
        super();
        this.value = value;
    }

    @Override
    public int hashCode() {
        return 1;
    }

    @Override
    public boolean equals(Object obj) {
        X a = (X) obj;
        if(this.value.equals(a.value)) {
            return true;
        }
        return false;
    }

}

Now I started putting values in hashmap.

HashMap<X, Integer> map = new HashMap<>();
    map.put(new X(1), 1);
    map.put(new X(2), 2);
    map.put(new X(3), 3);
    map.put(new X(4), 4);
    map.put(new X(5), 5);
    map.put(new X(6), 6);
    map.put(new X(7), 7);
    map.put(new X(8), 8);
    map.put(new X(9), 9);
    map.put(new X(10), 10);
    map.put(new X(11), 11);
    map.put(new X(12), 12);
    map.put(new X(13), 13);
    System.out.println(map.size());

All the nodes were entering the single bucket as expected, but I noticed that on the 9th entry, the hashmap rehashed and doubled its capacity. Now on the 10th entry, It again doubled its capacity.

With 8 entries With 9 entries

Can anyone explain how this is happening?

Thanks in advance.

  • 3
    The javadoc says: *When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed*. So, the number of occupied buckets is irrelevant. What matters is the number of entries. – JB Nizet Dec 30 '17 at 08:13
  • 1
    Okay thank you for clearing that out, but considering that the above hashmap should have resized itself on the 13th entry, but it did it on 9th and again on 10th. Why did it do that? – Sanchit Saxena Dec 30 '17 at 08:17
  • Could you read the source of the map implementation? – starmole Dec 30 '17 at 08:17
  • 3
    Here is a class reproducing your experimentation, first with Integers as keys (behaving as documented), then with X as keys (behaving "strangely"): https://gist.github.com/jnizet/34ca08ba0314c8e857ea9a161c175f13. HashMap changes to tree if buckets are too crowded, and the code says: "The smallest table capacity for which bins may be treeified. (Otherwise the table is resized if too many nodes in a bin.)". So, in addition to what is documented, the table also resizes itself when buckets are crowded, and when the capacity becomes 64, it changes to tree nodes. – JB Nizet Dec 30 '17 at 08:48
  • 1
    @JBNizet - Yes that sample is a better replication of my case. Yes, it seems that a crowded bucket is triggering doubled capacity of the hashmap. Do you have some source or reference where this rehashing on bucket size is explained further? – Sanchit Saxena Dec 30 '17 at 13:08
  • DON'T POST PICTURES OF TEXT HERE – user207421 Jan 10 '18 at 08:38

4 Answers4

2

In HashMap, entries will be present in same bucket if their hashcodes are same. If unique Integer objects are placed inside a hashMap, their hashcode will change definitely because they are different objects.

But in your case all the objects are having same hashcode. which means as you specified all entries will be in a single bucket. Each of these buckets follow a specific data structure(linkedList/tree). Here the capacity is changing according to that specific datastructure and hashmap.

I have run JB Nizet's code (https://gist.github.com/jnizet/34ca08ba0314c8e857ea9a161c175f13/revisions) mentioned in the comment with loop limits 64 and 128 (adding 64 and 128 elements):

  1. While adding 64 elements: The capacity got doubled(128) while adding 49th element to the HashMap (because load factor is 0.75)
  2. While adding 128 elements: The capacity got doubled(256) while adding 97th element to the HashMap (also because load factor is 0.75).

After increasing capacity to 64, the HashMap works normal.

In summary, bucket uses linked list to a certain length(8 elements). After that it uses tree data structure (where there is fluctuation in capacity). Reason is that accessing tree structure (O(log(n))) is faster than linked list (O(n)).

enter image description here

This picture shows an inner array of a JAVA 8 HashMap with both trees (at bucket 0) and linked lists (at bucket 1,2 and 3). Bucket 0 is a Tree because it has more than 8 nodes (readmore).

Documentation on Hashmap and discussion on bucket in hashmap would be helpful in this regard.

Nithin
  • 748
  • 1
  • 10
  • 27
  • Yes, I meant capacity only. Edited the question to correct it. Confusion is that why the hashmap doubled its capacity at 9th and 10th entry. Why not at 13th (which should be according to the doc)? @JBNizet has replicated and provided an explanation in the comment. I am reading more online to understand better. – Sanchit Saxena Dec 30 '17 at 09:16
  • I have modified my answer. Please let me know if there are any mistakes – Nithin Dec 30 '17 at 12:45
  • The buckets follow LinkedList Or Tree to store Map entries. However, the doc only states that - "When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets". So it not clear that on which criteria it is doubling its capacity. – Sanchit Saxena Dec 30 '17 at 13:15
  • I have provided my answer according to JB Nizet's code and more details which i came to know recently. Please check the answer – Nithin Dec 30 '17 at 15:16
  • Yes, that is right. I also debugged and found the same that the method "treeifyBin" is getting called at 9th and 10th entry which is resulting in doubling of capacity of the hashmap. After that, it's working normally. – Sanchit Saxena Dec 30 '17 at 19:07
2

Responding to the comments more than the question itself, since your comments are more relevant in what you want to know actually.

The best and most relevant answer to the where this rehashing on bucket size is explained further is the source code itself. What you observe on the 9-th entry is expected and happens in this portion of the code:

for (int binCount = 0; ; ++binCount) {
    // some irrelevant lines skipped
    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
        treeifyBin(tab, hash);
        break;
    }

where TREEIFY_THRESHOLD = 8 and binCount is the number of bins.

That treeifyBin method name is a bit misleading since, it might re-size, not treefiy a bin, that's the relevant part of the code from that method:

if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)
        resize();

Notice that it will actually resize (read double it's size) and not make a Tree until MIN_TREEIFY_CAPACITY is reached (64).

Eugene
  • 117,005
  • 15
  • 201
  • 306
1

Read the source code of hashmap,

/**
 * The smallest table capacity for which bins may be treeified.
 * (Otherwise the table is resized if too many nodes in a bin.)
 * Should be at least 4 * TREEIFY_THRESHOLD to avoid conflicts
 * between resizing and treeification thresholds.
 */
static final int MIN_TREEIFY_CAPACITY = 64;

and you will see

  1. if the capacity does not reach MIN_TREEIFY_CAPACITY(64), and nodes on single bucket reach the TREEIFY_THRESHOLD, now map will resize.
  2. if the capacity exceeds MIN_TREEIFY_CAPACITY(64), and nodes on single bucket reach the TREEIFY_THRESHOLD, now map will treeify the nodes on bucket(aka bins in the source code).

Resize and treeify are two operations which can bring map reorganize, and the above decisions based on different scenarios is also a tradeoff.

Kaka Diego
  • 56
  • 3
0

The simple mathematical formula to calculate loadfactor based on number of bucket array(b) occupied and number of entries(n) in hashmap is n/b.

Procrastinator
  • 2,526
  • 30
  • 27
  • 36