15

I am trying to understand how HashMap is implemented in Java. I decided that I will try to understand every line (of code and comments) from that class and obviously I faced resistance very soon. The following snippet is from HashMap class and talks about Poisson Distribution:

 Ideally, under random hashCodes, the frequency of
 nodes in bins follows a Poisson distribution
 (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 parameter of about 0.5 on average for the default resizing
 threshold of 0.75, although with a large variance because of
 resizing granularity. Ignoring variance, the expected
 occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 factorial(k)). The first values are:
 0:    0.60653066
 1:    0.30326533
 2:    0.07581633
 3:    0.01263606
 4:    0.00157952
 5:    0.00015795
 6:    0.00001316
 7:    0.00000094
 8:    0.00000006
 more: less than 1 in ten million

I am an average guy in Math and had to understand what Poisson distribution is first. Thanks to the simple video that explained it to me.

Now even after understanding how you calculate probability using Poisson I can't understand what is described above.

Can someone please explain this in simpler language and with an example if possible? It will make my task much more interesting.

aioobe
  • 413,195
  • 112
  • 811
  • 826
Asif
  • 1,288
  • 1
  • 16
  • 27

2 Answers2

14

A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.

To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:

bucket | elements
-------+---------
     0 | Z
     1 | X
     2 |
     3 | Y

so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:

bucket | elements
-------+---------
     0 | 
     1 | Z -> X -> Y
     2 |
     3 |

This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.

This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.

One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.

The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.

In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.

In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.

The link to the source file with the comment in question is here:

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/jdk8-b119/src/share/classes/java/util/HashMap.java

Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
  • Is having one element per bucket a good idea, as you say in the first paragraph ? Or should the elements be evenly distributed among buckets? – Atul Dec 10 '13 at 05:10
  • 1
    Mostly the same thing. With a load factor of 75% (the default) if elements are evenly distributed among buckets, every nonempty buckets will have one element. I guess I should clarify that statement to say "at most one" element. – Stuart Marks Dec 10 '13 at 05:17
  • I think, it would be better to have 10 buckets with 5 elements each than having 50 buckets with one element each.There are two parts about finding an element in a HashMap- locating the right bucket and then finding the right element. So the total time complexity is driven by a combination for the two activities. – Atul Dec 10 '13 at 06:00
  • 3
    Finding the right bucket is an O(1) operation: it's a hash computation followed by a masking operation. Within a bucket, finding the right element is a linear search, an O(n) operation. If you want faster searches, then you want fewer elements in each bucket. – Stuart Marks Dec 10 '13 at 06:43
  • But isn't Java's `HashMap` based on open addressing rather than separate chaining? How does the JDK 8 optimization fit into open addressing? – Siyuan Ren Dec 10 '13 at 07:45
  • 1
    It's always been separate chaining as far as I know. Historically the separate chains have simply been linked lists. JDK 8 adds conversion to a tree if the list gets too long. – Stuart Marks Dec 10 '13 at 08:24
  • Stuart man I love your explanations. Can't thank you enough. I found an interesting comment above - that finding the right bucket is an O(1) operation. I always though it would be O(N). I think I have some reading to do on separate chaining etc. Are there any sites that explain Java classes like say the HashMap etc. the way you just did? Thanks again ! – Asif Dec 11 '13 at 05:33
  • 1
    Again, glad to help! Here's a "How HashMap Works" article, though it doesn't cover the conversion from lists to trees: http://javarevisited.blogspot.com/2011/02/how-hashmap-works-in-java.html . To find the bucket, following the get() code is fairly easy. It calls hash(), basically calling the key's hashCode() plus some special cases. Then at line 569 it masks off the low order bits of hash to get the index into the table. Note, `n` is always a power of 2 so `n-1` is a bitmask that gives the right range for the index. – Stuart Marks Dec 11 '13 at 19:44
  • Thanks Stuart. I read the article and then went back to the java 8 implementation. Based on what is explained about the creation of the tree I understand that the tree is created by using the compareTo method of the keys in one bucket and if the key doesn't implement Comparable then the tree is created in such a manner that it wastes twice as much time and space as we would have spent if there were no trees at all. Is the understanding correct? – Asif Dec 12 '13 at 06:44
  • 1
    The bit about TreeNode wasting 2x time and space is an edge case, where large numbers of keys have the same hashcode and the key doesn't implement Comparable. The point is that such cases are rare. In the typical use of TreeNode it consumes more space but is faster at finding keys than the linked list. In fact even "typical" use of TreeNode should already be quite rare, per the discussion above. So the "wasting 2x time and space" is an edge case of an edge case. – Stuart Marks Dec 12 '13 at 19:14
  • Thanks. So my understanding about compareTo being used to create the tree in the bucket is right because that's all one can use with the hash codes being the same inside the bucket. Correct? I know according to someone this must be straight to understand forward from the comments but that's what my issue is - I can't seem to understand the comments very clearly. – Asif Dec 16 '13 at 17:50
  • 2
    Regarding compareTo, not quite. It's used only if hash codes are equal. Keys with different hash codes might end up in the same bucket, in which case the tree is ordered by hash code. Then, if the hash codes are tied, *and if the key is Comparable*, compareTo is used. If the hash codes are tied and the key is not Comparable, I'm not sure what it does. It might have to check all of the keys that have equal hash codes for equality using equals(). BTW the comments aren't straightforward. You have to know a lot already before you can understand everything they're saying. – Stuart Marks Dec 18 '13 at 02:29
2

The accepted answer is great but I just wanted to fill in why it's reasonable to use a Poisson distribution in particular since I had the exact same question when reading that piece of code.

In the case that we have a fixed number of items k being inserted into a fixed number of buckets n then the number of items in a fixed bucket should follow a Binomial Distribution with k trials and probality of success 1 / n. This is pretty easy to see; if the hash is random then each item is put into our bucket with probability 1 / n and there are k items.

When k is large and the mean of the Binomial Distribution is small then a good approximation is a Poisson Distribution with the same mean. In this case the mean is k / n, the the load factor of the hash table. Taking 0.5 for the mean is reasonable because the table tolerates a load factor of at most 0.75 before resizing so the table will be used a great deal with a load factor of around 0.5.

mrmcgreg
  • 2,754
  • 1
  • 23
  • 26
  • I think that take 0.5 is based on this assumption: 0.25% items are in the same bucket due to hash collision. What do you think? – Jason Mar 12 '19 at 08:55