25

As the title suggests this is a question about an implementation detail from HashMap#resize - that's when the inner array is doubled in size. It's a bit wordy, but I've really tried to prove that I did my best understanding this...

This happens at a point when entries in this particular bucket/bin are stored in a Linked fashion - thus having an exact order and in the context of the question this is important.

Generally the resize could be called from other places as well, but let's look at this case only.

Suppose you put these strings as keys in a HashMap (on the right there's the hashcode after HashMap#hash - that's the internal re-hashing.) Yes, these are carefully generated, not random.

 DFHXR - 11111
 YSXFJ - 01111 
 TUDDY - 11111 
 AXVUH - 01111 
 RUTWZ - 11111
 DEDUC - 01111
 WFCVW - 11111
 ZETCU - 01111
 GCVUR - 11111 

There's a simple pattern to notice here - the last 4 bits are the same for all of them - which means that when we insert 8 of these keys (there are 9 total), they will end-up in the same bucket; and on the 9-th HashMap#put, the resize will be called.

So if currently there are 8 entries (having one of the keys above) in the HashMap - it means there are 16 buckets in this map and the last 4 bits of they key decided where the entries end-up.

We put the nine-th key. At this point TREEIFY_THRESHOLD is hit and resize is called. The bins are doubled to 32 and one more bit from the keys decides where that entry will go (so, 5 bits now).

Ultimately this piece of code is reached (when resize happens):

 Node<K,V> loHead = null, loTail = null;
 Node<K,V> hiHead = null, hiTail = null;
 Node<K,V> next;
 do {
     next = e.next;
     if ((e.hash & oldCap) == 0) {
          if (loTail == null)
               loHead = e;
          else
               loTail.next = e;
          loTail = e;
     }
     else {
        if (hiTail == null)
            hiHead = e;
        else
            hiTail.next = e;
        hiTail = e;
     }
 } while ((e = next) != null);



 if (loTail != null) {
     loTail.next = null;
     newTab[j] = loHead;
 }
 if (hiTail != null) {
     hiTail.next = null;
     newTab[j + oldCap] = hiHead;
 }

It's actually not that complicated... what it does it splits the current bin into entries that will move to other bins and to entries that will not move to other bins - but will remain into this one for sure.

And it's actually pretty smart how it does that - it's via this piece of code:

 if ((e.hash & oldCap) == 0) 

What this does is check if the next bit (the 5-th in our case) is actually zero - if it is, it means that this entry will stay where it is; if it's not it will move with a power of two offset in the new bin.

And now finally the question: that piece of code in the resize is carefully made so that it preserves the order of the entries there was in that bin.

So after you put those 9 keys in the HashMap the order is going to be :

DFHXR -> TUDDY -> RUTWZ -> WFCVW -> GCVUR (one bin)

YSXFJ -> AXVUH -> DEDUC -> ZETCU (another bin)

Why would you want to preserve order of some entries in the HashMap. Order in a Map is really bad as detailed here or here.

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • 2
    hey, the comment already said that: "*a power of two offset in the new table*". – holi-java Jul 30 '17 at 20:55
  • @holi-java but the order is preserved. Its not about what bucket does it move to, they preserve the order on purpose with that piece of code. – Eugene Jul 30 '17 at 21:01
  • @Eugene: Great question, I'm curiously awaiting the answers. :) I don't quite agree with your tags, though. Unless this is _specific_ to Java 8 or 9 I would remove those tags. Also, why not use the `java` tag? – Nicolai Parlog Jul 31 '17 at 07:45
  • 2
    @Nicolai to be honest the java-8 and java-9 tags are here just because I *really* hope to get the attention of some particular users that might know the answer... – Eugene Jul 31 '17 at 07:46
  • 2
    @Nicolai and also this *is* specific to java-8 and onwards for `HashMap` – Eugene Jul 31 '17 at 07:47
  • Yeah, I thought it might be 8-specific. Not 9, though, right? In that case that should be removed. Still, ingenious tactic. ;) If you're on Twitter, you can reach out to those particular users that way. – Nicolai Parlog Jul 31 '17 at 07:51
  • @Nicolai still specific to 9 - meaning it has not changed in 9 - so I think it still stands. twitter is one great advice - thank you – Eugene Jul 31 '17 at 07:52
  • There should be a possibility to call Andy Turner somehow to the question :p – xenteros Jul 31 '17 at 08:09
  • 1
    @Eugene By that logic every question that discusses something that was not removed in Java 9 could be tagged with `java-9`. This would make the tag useless as a marker for features introduced in Java 9. If you still disagree, we might want to discuss this [in the chat](https://chat.stackoverflow.com/rooms/139/java). – Nicolai Parlog Jul 31 '17 at 09:45
  • 1
    @Nicolai yeah, that makes sense probably. Ill edit – Eugene Jul 31 '17 at 09:53
  • Um... if I understood the question correctly, then the answer **might** be dead simple: `LinkedHashMap extends HashMap`. In order to keep the order, the order has to be kept. I didn't check the implementation in detail, so this is not an answer, but am pretty sure that the `LinkedHashMap` relies on this behavior of its base class. – Marco13 Jul 31 '17 at 11:43
  • @Marco13 that's an interesting idea.. but in case of `TreeNodes` the order would still somehow needed to be preserved than. I'll have a look at that – Eugene Jul 31 '17 at 11:46
  • @Marco13 `LinkedHashMap` preserves the order by implementing a doubly linked list by additional references in its `Entry`-class: `Entry before, after;` – Hulk Jul 31 '17 at 11:49
  • @Hulk That's basically true. Again, I didn't review the code (it's a bit complicated, and IIRC there have been some significant changes between Java7 and Java8). But I could imagine that at some point after a rehash, one of the hook methods may be called in `LinkedHashMap` carrying a comment like `/* We know they are in the right order*/` - but again, until now, that's **only a guess** – Marco13 Jul 31 '17 at 11:58
  • @Marco13 Agreed - that's quite possible. It might also be a leftover from some intermediate version hinted at by some comments in LinkedHashMap: "A previous version of this class was internally structured a little differently. Because superclass HashMap now uses trees for some of its nodes, class LinkedHashMap.Entry is now treated as intermediary node class that can also be converted to tree form." – Hulk Jul 31 '17 at 13:14
  • 1
    Actually, I don’t see much effort in preserving the order. The code happens to preserve it, but it’s also straight-forward. Letting explicit shuffling aside, I don’t see much room to do it differently. The only irritating thing is that *comment* suggesting that order preservation is intended. Since the resulting order is entirely different to Java 7, this can’t be any compatibility issue. Also, there is no connection to `LinkedHashMap`, which always used its own links, independent from the `next` reference. – Holger Jul 31 '17 at 15:41
  • 1
    Interestingly, in [this method](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/HashMap.java#2120) there is also a comment regarding “preserving order”, but *which order*? The order imposed by the `next` references has no actual meaning after tree node insertions… – Holger Jul 31 '17 at 16:58
  • @Holger They could have done something along the lines... `Entry entry = newTab[e.hash & (n-1)]` and then `if(entry == null) entry = e else entry.next = e` and so on; but that would *still* preserver order, so I guess you are right – Eugene Jul 31 '17 at 20:29
  • @Holger I wonder if this has been measured to be the best way for mapping (both space and time); instead of doing the `newTab[e.hash & (n-1)]` all the time... – Eugene Jul 31 '17 at 20:36
  • 1
    @Eugene: it should be no question that accessing a local variable is cheaper than accessing an array element, especially in cases where the optimizer is not capable of determining that `newTab[e.hash & (n-1)]` is always within the array bounds. – Holger Aug 01 '17 at 09:19
  • @Holger good point, so this basically means that it's just the way ti should be done right? At least this is my impression so far... also to me that counts as an answer - taking into consideration that `LinkedHashMap` is not involved here – Eugene Aug 01 '17 at 10:37
  • 1
    @Eugene: that applies to the code. But we still don’t know why the comment is referring to the order retaining behavior… – Holger Aug 01 '17 at 12:40
  • @Eugene ... is the resulting structure the same that you would get if you created a new HashMap initialized to the final size? Growing a collection is a fairly common and somewhat expensive operation ... it's not surprising that somebody would have found this as a way to optimize the process. It may not be so much that the code is intentionally trying to preserve the order as it is a shortcut to get to what they know would be the end state? – Steve Aug 01 '17 at 15:32
  • @steve well yes - we have already established this in the comments... the only annoying thing is that comment `preserver order` there – Eugene Aug 02 '17 at 11:12
  • @eugene, @Nicolai. I would leave the Java-8, 9? tags in. The implementation of `HashMap` has evolved in Java and while some features have remained static I think to get clear answers you need to pick a single implementation. – Persixty Aug 02 '17 at 12:53
  • Maybe it's some micro-optimization for the case where sequentially allocated, colliding entries also get linked sequentially? i.e. the degenerate case where everything goes into a single bucket. – the8472 Jan 31 '18 at 18:55
  • @the8472 I wish I could tell that I understood this comment... care to explain a bit more? thank you – Eugene Jan 31 '18 at 20:16
  • If you allocate say 10 objects at once in the young generation they're all laid out in memory sequentially and will get moved around like that by the garbage collectors too. Now assume that they have a bad hashcode implementation (they just return a constant). Which means they all go into the same linked list or tree bins in the hash map. Now iterating through that list/tree will iterate them in memory order, which is something CPUs optimize for (cache line prefetch). It preserves locality of reference. So it makes the bad case slightly less bad. Just speculating of course. – the8472 Jan 31 '18 at 20:29
  • @the8472 I like the locality argument! It would make sense if the locality of data would take place, but locality of references... I can't tell how that would help, still might be the first argument that makes sense here – Eugene Feb 04 '18 at 21:00
  • I think the possible reason to do this is to solve the infinite loop problem as mentioned at the following link: https://javabypatel.blogspot.com/2016/01/infinite-loop-in-hashmap.html. As mentioned here, the infinite loop problem occurs due to to the reversal of the order of nodes, so by preserving the order, this problem can be avoided. – abhijeet pathak Dec 06 '19 at 13:13
  • 1
    It’s really baffling, the answer was there, all the time, just at a different place in the source code. I already used it in [an answer to a related question](https://stackoverflow.com/a/53659823/2711488) and had to be pointed at it later-on, to realize it. And yes, locality is among the considerations. – Holger Apr 20 '20 at 16:50
  • @Holger you _need_ to make this an answer here, please! – Eugene Apr 20 '20 at 17:42

3 Answers3

3

The design consideration has been documented within the same source file, in a code comment in line 211

* When bin lists are treeified, split, or untreeified, we keep 
* them in the same relative access/traversal order (i.e., field 
* Node.next) to better preserve locality, and to slightly 
* simplify handling of splits and traversals that invoke 
* iterator.remove. When using comparators on insertion, to keep a 
* total ordering (or as close as is required here) across 
* rebalancings, we compare classes and identityHashCodes as 
* tie-breakers. 

Since removing mappings via an iterator can’t trigger a resize, the reasons to retain the order specifically in resize are “to better preserve locality, and to slightly simplify handling of splits”, as well as being consistent regarding the policy.

Holger
  • 285,553
  • 42
  • 434
  • 765
1

Order in a Map is really bad [...]

It's not bad, it's (in academic terminology) whatever. What Stuart Marks wrote at the first link you posted:

[...] preserve flexibility for future implementation changes [...]

Which means (as I understand it) that now the implementation happens to keep the order, but in the future if a better implementation is found, it will be used either it keeps the order or not.

MaanooAk
  • 2,418
  • 17
  • 28
  • 1
    please read the comments as this does not answer the question IMO – Eugene Aug 02 '17 at 11:13
  • Do you mean that there is a better impl, therefore my answer is wrong? (or are you referring to other comments, there are a lot of comments...) – MaanooAk Aug 02 '17 at 11:48
  • it's *not wrong* - anyone is free to up vote it for example; but I don't find it answers my question – Eugene Aug 02 '17 at 11:49
  • Ok no problem, but you think that does not answer the question because there is better impl, correct? (so I understand you) (EDIT: I may sound aggressive, but I'm not, just to be clear) – MaanooAk Aug 02 '17 at 11:51
  • Edited based on @Eugene observation – MaanooAk Aug 02 '17 at 14:21
1

There are two common reasons for maintaining order in bins implemented as a linked list:

One is that you maintain order by increasing (or decreasing) hash-value. That means when searching a bin you can stop as soon as the current item is greater (or less, as applicable) than the hash being searched for.

Another approach involves moving entries to the front (or nearer the front) of the bucket when accessed or just adding them to the front. That suits situations where the probability of an element being accessed is high if it has just been accessed.

I've looked at the source for JDK-8 and it appears to be (at least for the most part) doing the later passive version of the later (add to front):

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

While it's true that you should never rely on iteration order from containers that don't guarantee it, that doesn't mean that it can't be exploited for performance if it's structural. Also notice that the implementation of a class is in a privilege position to exploit details of its implementation in a formal way that a user of that class should not.

If you look at the source and understand how its implemented and exploit it, you're taking a risk. If the implementer does it, that's a different matter!

Note: I have an implementation of an algorithm that relies heavily on a hash-table called Hashlife. That uses this model, have a hash-table that's a power of two because (a) you can get the entry by bit-masking (& mask) rather than a division and (b) rehashing is simplified because you only every 'unzip' hash-bins.

Benchmarking shows that algorithm gaining around 20% by actively moving patterns to the front of their bin when accessed.

The algorithm pretty much exploits repeating structures in cellular automata, which are common so if you've seen a pattern the chances of seeing it again are high.

Persixty
  • 8,165
  • 2
  • 13
  • 35
  • I don't get this `That means when searching a bucket you can stop as soon as the current item is greater`; internally what is being done is actually `(n - 1) & hash` or a *negative safe modulo* for power-of-two-bins. If I'm missing something plz provide an example of that actually means. – Eugene Aug 02 '17 at 12:31
  • Also this `That suits situations where the probability of an element being accessed is high if it has just been accessed` do you mean the if you just added `EntryA` and read it afterwards - it will make the reading faster? How come? it will the be *last* in the linked bucket – Eugene Aug 02 '17 at 12:33
  • of course this also : `rehashing is simplified because you only every 'unzip' hash-bins` makes very little sense to me; what re-hashing? The Maps rehashing is actually XOR-ing the first 16 bits and last 16 bits; or is it about other re-hashing? And the last: we have already established that keeping the order via that code or doing `newTab[(n - 1) & hash]` would do the same thing - still keep the order. But *why* explicitly mention that... – Eugene Aug 02 '17 at 12:38
  • Ans 1: Sorry, I've mixed terminology. I normally call the collection of entries with (entry.hash%table.size) equal a bucket. So if you're look for an entry e. You go to the bucket with index e.hash%table.size and there you find a (possibly empty) bin (or bucket). If that is sorted in hash order you then search it linearly until the current item c has a hash>e.hash in which case you know that e isn't present. It makes failing to find e quick. As a bonus you've also stopped searching exactly where you should insert e! – Persixty Aug 02 '17 at 12:44
  • Ans 2: You add new items to the front (head) of the list. So if you come to that bin you will find it first. Lots of applications have the statistical feature (and/or can be tuned to have the feature) that entries are more likely than other entries to be re-accessed soon. – Persixty Aug 02 '17 at 12:46
  • @eugene: Ans 3: Re-hashing is terminology again. I mean re-sizing the hash table. By 'unzipping' I mean the process you mentioned in the question. Taking one list and making it two but preserving order. It's a kind of merge unsort if you will. Imagine the two sides of a zipper as the two lists... – Persixty Aug 02 '17 at 12:48
  • @1) I get the point about `hash sorting` - thx; but this is not the way it is done in `HashMap` right now - it's only insertion order 2) new items are added to the end of the list (always the `next` in `HashMap`), so again kind of irrelevant 3) yes - that is how it happens currently - but I fail to see how your answer would actually explain the question taking into consideration that your arguments kind of don't match what actually happens right now. Don't get me wrong, but it simply does not cut it... – Eugene Aug 02 '17 at 12:59
  • @eugene. The sorting point is an aside of why (half) preserving order in hash-maps is a good idea. The source I linked generally has `tab[i] = newNode(hash, key, v, first);` which puts the new entry at the head of the bin. `first==tab[i]` at this point. One of the few locations where it doesn't is in de-serialization which (in fact) recovers the order written to the stream so it itself order preserving. See line 1201. Though I actually think the implementation is bugged and in some places trying to preserve order and others fails. – Persixty Aug 02 '17 at 13:05
  • @Perixty `tab` is the actual `table` - the array if you want in that code, i is actually `i = (n - 1) & hash]` so it is actually the index in the array. And this `tab[i] = newNode(hash, key, v, first);` happens ONLY when the bin is empty - all the next elements are put LAST: `p.next = newNode(hash, key, value, null);` – Eugene Aug 02 '17 at 13:13
  • @Eugene. No it doesn't. `first = (first = tab[i = (n - 1) & hash])` on 1172 and if it gets 1201 (look at `newNode()`) will add the new node at the head of the chain. The implementation of `HashMap` is (to be fair) a mess. I think because the threshold shift to using a bin tree seems to have been shoe-horned in. But yes in other places it does add them at the end. I can understand de-serializing but the others look muddled. it doesn't matter so much with the 'tree scaling' but I think that's what is meant to do... – Persixty Aug 02 '17 at 13:20
  • let's put that another way... (besides the fact that you actually linked code from *other* methods). Say I have these keys `YSXFJ` and `AXVUH`. They will end up in the same bucket (you can test this). I first insert `YSXFJ` and then insert `AXVUH`. What would be the order when I print this `Map`? – Eugene Aug 02 '17 at 13:32
  • @Eugene. I know it doesn't always do that. That's why I said most places. It's not clear to me whether the order preservation is still an intended feature or an artifact of the era before it threshold scaled into turning 'big' bins into trees and now considered immaterial. – Persixty Aug 02 '17 at 14:21
  • So the solution is to submit the code where the entry is added in the head of the list (faster, follows spec) in order to change the HashMap at some point in the future (Java 13) and your question will disappear forever ;) – MaanooAk Aug 02 '17 at 14:26
  • @MaanooAk Maybe. It depends whether anyone comes back to the implementation of `HashMap` as I say whether the historical features are considered worth recovering. The point I wanted to make about hash-maps in general is to show how users shouldn't exploit there structure but implementers can (should?) and that's the power of 'information hiding'. That (to me) is the interesting point. On day someone might decide the best implementation is a hash-hierarchy with sub-tables of higher order hash-bits. And the specific question is moot! – Persixty Aug 02 '17 at 14:33