16

I have sort of a hard time understanding an implementation detail from java-9 ImmutableCollections.SetN; specifically why is there a need to increase the inner array twice.

Suppose you do this:

Set.of(1,2,3,4) // 4 elements, but internal array is 8

More exactly I perfectly understand why this is done (a double expansion) in case of a HashMap - where you never (almost) want the load_factor to be one. A value of !=1 improves search time as entries are better dispersed to buckets for example.

But in case of an immutable Set - I can't really tell. Especially since the way an index of the internal array is chosen.

Let me provide some details. First how the index is searched:

 int idx = Math.floorMod(pe.hashCode() ^ SALT, elements.length);

pe is the actual value we put in the set. SALT is just 32 bits generated at start-up, once per JVM (this is the actual randomization if you want). elements.length for our example is 8 (4 elements, but 8 here - double the size).

This expression is like a negative-safe modulo operation. Notice that the same logical thing is done in HashMap for example ((n - 1) & hash) when the bucket is chosen.

So if elements.length is 8 for our case, then this expression will return any positive value that is less than 8 (0, 1, 2, 3, 4, 5, 6, 7).

Now the rest of the method:

 while (true) {
        E ee = elements[idx];
        if (ee == null) {
            return -idx - 1;
        } else if (pe.equals(ee)) {
            return idx;
        } else if (++idx == elements.length) {
            idx = 0;
        }
    }

Let's break it down:

if (ee == null) {
    return -idx - 1;

This is good, it means that the current slot in the array is empty - we can put our value there.

} else if (pe.equals(ee)) {
    return idx;

This is bad - slot is occupied and the already in place entry is equal to the one we want to put. Sets can't have duplicate elements - so an Exception is later thrown.

 else if (++idx == elements.length) {
      idx = 0;
 }

This means that this slot is occupied (hash collision), but elements are not equal. In a HashMap this entry would be put to the same bucket as a LinkedNode or TreeNode - but not the case here.

So index is incremented and the next position is tried (with the small caveat that it moves in a circular way when it reaches the last position).

And here is the question: if nothing too fancy (unless I'm missing something) is being done when searching the index, why is there a need to have an array twice as big? Or why the function was not written like this:

int idx = Math.floorMod(pe.hashCode() ^ SALT, input.length);

// notice the diff elements.length (8) and not input.length (4)
Eugene
  • 117,005
  • 15
  • 201
  • 306
  • @GhostCat https://github.com/netroby/jdk9-dev/blob/master/jdk/src/java.base/share/classes/java/util/ImmutableCollections.java#L446-L547 (The `probe` method) – Vince Jul 27 '17 at 15:11
  • @VinceEmigh To finalize that other discussion: often things aren't black or white. I have seen plenty of answers of 6digits folks that did provide even less content or more "fuziness", but acquired plenty of votes in a few hours. And I have seen good content sitting around long ... and nothing happening. And you see - zero other answers, not even comments on this question. So maybe a list of potential explanations doesn't account for a great answer - but it can provide food for thought. And beyond that: isn't that what voting is for? I take downvotes as direct feedback ... and well: I try do – GhostCat Jul 27 '17 at 18:29
  • @VinceEmigh to come up with something better the next time. – GhostCat Jul 27 '17 at 18:29

1 Answers1

15

The current implementation of SetN is a fairly simple closed hashing scheme, as opposed to the separate chaining approach used by HashMap. ("Closed hashing" is also confusingly known as "open addressing".) In a closed hashing scheme, elements are stored in the table itself, instead of being stored in a list or tree of elements that are linked from each table slot, which is separate chaining.

This implies that if two different elements hash to the same table slot, this collision needs to be resolved by finding another slot for one of the elements. The current SetN implementation resolves this using linear probing, where the table slots are checked sequentially (wrapping around at the end) until an open slot is found.

If you want to store N elements, they'll certainly fit into a table of size N. You can always find any element that's in the set, though you might have to probe several (or many) successive table slots to find it, because there will be lots of collisions. But if the set is probed for an object that's not a member, linear probing will have to check every table slot before it can determine that object isn't a member. With a full table, most probe operations will degrade to O(N) time, whereas the goal of most hash-based approaches is for operations to be O(1) time.

Thus we have a class space-time tradeoff. If we make the table larger, there will be empty slots sprinkled throughout the table. When storing items, there should be fewer collisions, and linear probing will find empty slots more quickly. The clusters of full slots next to each other will be smaller. Probes for non-members will proceed more quickly, since they're more likely to encounter an empty slot sooner while probing linearly -- possibly after not having to reprobe at all.

In bringing up the implementation, we ran a bunch of benchmarks using different expansion factors. (I used the term EXPAND_FACTOR in the code whereas most literature uses load factor. The reason is that the expansion factor is the reciprocal of the load factor, as used in HashMap, and using "load factor" for both meanings would be confusing.) When the expansion factor was near 1.0, the probe performance was quite slow, as expected. It improved considerably as the expansion factor was increased. The improvement was really flattening out by the time it got up to 3.0 or 4.0. We chose 2.0 since it got most of the performance improvement (close to O(1) time) while providing good space savings compared to HashSet. (Sorry, we haven't published these benchmark numbers anywhere.)

Of course, all of these are implementation specifics and may change from one release to the next, as we find better ways to optimize the system. I'm certain there are ways to improve the current implementation. (And fortunately we don't have to worry about preserving iteration order when we do this.)

A good discussion of open addressing and performance tradeoffs with load factors can be found in section 3.4 of

Sedgewick, Robert and Kevin Wayne. Algorithms, Fourth Edition. Addison-Wesley, 2011.

The online book site is here but note that the print edition has much more detail.

Stuart Marks
  • 127,867
  • 37
  • 205
  • 259
  • 2
    would it make sense to add a simple comment in code of the algorithm used? that single line could have saved me some time while I was digging into *why*s. BTW HashMap has a bunch of these comments under implementation details... – Eugene Jul 28 '17 at 09:16
  • 1
    when you said search I immediately made a grumpy face. I obviously rushed with the question and did not think about other aspects. Thank you – Eugene Jul 28 '17 at 09:16
  • 4
    When we are at it… is it intentional that these immutable collections have no optimized `forEach(…)` or `spliterator()` method? – Holger Jul 28 '17 at 16:19
  • 1
    @Stuart Are you happy with linear probing performace and memory vs time trade-off so far? I guess that for immutable sets with a few elements linear probing doesn't hurt at all, but what do your tests show for immutable sets of i.e. 1k, 10k and 100k elements? Are you expecting these immutable collections to be used for such large number of elements? If yes, then maybe a dynamic probe scheme might be of help... – fps Jul 31 '17 at 20:51
  • 4
    @FedericoPeraltaSchaffner I think the current performance and space tradeoffs are satisfactory, but they certainly can be improved. The design center is a small number of elements but performance scales expectedly even with large numbers of elements, still O(1), though slower than `HashMap`. Better probing would be nice, but I'm more concerned about bad performance in the presence of bad `hashCode` implementations. – Stuart Marks Aug 01 '17 at 04:17