10

Code speaks more than words, so:

final int size = 100;
Map<Integer, String> m = new HashMap<>(size);
for (int i = 0; i < size; i++) m.put(i, String.valueOf(i));

Why is the HashMap internally calling resize() 21 2 times! (Credit to Andreas for identifying that the JVM uses HashMaps internally, 19 of the 21 cals were from other processes)

Two resize() calls is still not acceptable for my application. I need to optimize this.

If I am a new java developer, my first intuitive guess at what "capacity" means in the HashMap constructor is that it is the capacity for the number of elements that I (the consumer of HashMap) am going to put into the Map. But this is not true.

If I want to optimize my usage of HashMap so that it does not need to resize itself at all, then I need to know the internals of HashMap intimately enough to know exactly how sparse the HashMap bucket array needs to be. This is strange in my opinion. HashMap should implicitly do this for you. It is the whole point of encapsulation in OOP.

Note: I have confirmed that resize() is the bottleneck for my applications use case, so that is why my goal is to reduce the number of calls to resize().

The question:

If I know the exact quantity of entries I am going to put into the map beforehand. What capacity do I chose, to prevent any extra calls resize() operations? Something like size * 10? I would also like some background on why HashMap is designed this way.

Edit: I am getting asked a lot why this optimization is necassary. My application is spending a non-trivial amount of CPU time in hashmap.resize(). The hashmaps my application uses are initialized with a capacity equal to the number of elements that we put into it. Therefore, if we can reduce the resize() calls (by choosing a better initial capacity), then my application performance is improved.

James Wierzba
  • 16,176
  • 14
  • 79
  • 120
  • 1
    That `HashMap` should only have to resize once, from an initial capacity of 128, to a final capacity of 256. How have you measured the number of times it has resized during your code execution? – Bobulous Oct 05 '18 at 18:34
  • @CarlosHeuberger I compiled the source code, for java 1.8 (revision 181) – James Wierzba Oct 05 '18 at 18:40
  • maybe debugging shows what is happening... kind of strange since resizing should double the size (java 11) – user85421 Oct 05 '18 at 18:41
  • 3
    It's still not clear how you've counted twenty-one calls to `resize()` during the execution of this specific chunk of code. Just compiling the code will surely not show you how many times the loop causes the actual `HashMap` object to be resized. What means have you used to count the actual number of times a resize occurs? – Bobulous Oct 05 '18 at 18:47
  • @Bobulous You are right, ***my*** hashmap is getting resized only twice -- see my updated answer. (2 is still unacceptable for my case) – James Wierzba Oct 05 '18 at 18:55
  • 1
    What is the problem of it being called 2 times? first = first creation; second = because load factor 0.75 - if you know how many elements, change load factor to 1.0 (or devide number of elements by the load factor to get initial capacity) – user85421 Oct 05 '18 at 18:57
  • 2
    If you don't want it to resize twice (technically once), then either specify a larger intial size (~134), or create it with a higher load factor, which BTW, is something which you should not tweak without carefully considering described in its documentation. – Mark Rotteveel Oct 05 '18 at 19:00
  • @CarlosHeuberger because my applications cpu profiling shows a large percentage of time being spent in hashmap.resize(), and these hashmaps are initialized with an initial capacity that is equal to the number of elements inside of it (we know this beforehand). Therefore, to optimize this, we need to reduce the calls to resize(). Do you agree? – James Wierzba Oct 05 '18 at 19:00
  • 2
    And you don't need to know the internals intimately, you just need to read the API documentation. This behavior is part of its contract, it is not just an implementation detail. – Mark Rotteveel Oct 05 '18 at 19:01
  • 1
    I don;t have enough information to agree or not... well, kind of hard to believe that 2 times will do that much harm... – user85421 Oct 05 '18 at 19:02
  • and the first call is just part of the (postponed) creation of the map, no way to avoid that one – user85421 Oct 05 '18 at 19:09
  • Hmmm... it's not clear that setting the initial capacity to `(expected_size*4.0/3.0)+1` will make your app run faster, or spend a lower amount of CPU time. You should consider other aspects, i.e. do you add and remove elements many times? This might cause many resize calls when size is about 75 (for 100 entries)... – fps Oct 05 '18 at 19:11
  • @JamesWierzba If CPU shows large percentage of time being spent in hashmap.resize(), it's likely because the `hashCode()` methods of the keys are slow. Any chance of optimizing those (too)? Sure, `initialCapacity = expectedSize * 4/3 + 1` will eliminate the 75% full resizing, but if `hashCode()` is slow, you might be able to improve initial `put()` performance too. – Andreas Oct 05 '18 at 19:15
  • @Andreas and while not that ridiculous (we use it internally), when keys are *known* and hashCode is rather expensive - we *cache* it, on an absolutely serious note. – Eugene Oct 05 '18 at 19:18
  • @Eugene Exactly like `String` is doing, specifically because calculating hash of a long string is expensive. Of course, it requires the object to be immutable, *or* for mutations to clear the hash cache. But then again, the hash of an object must be immutable while the object is a key in a `HashMap`, so that's not really a tough restraint to live with. – Andreas Oct 05 '18 at 19:19
  • If you really care about lookup cost more than space cost then consider using a lower load factor to decrease the probability of hash collisions even more, but certainly test and measure performance impacts. – xtratic Oct 05 '18 at 19:32
  • @Andreas How have you come to the conclusion that having a slow `hashCode()` method for the keys will make the app spend a large amount of time in `resize()`? I don't see that `hashCode()` is being called from `resize()`... – fps Oct 05 '18 at 19:33
  • @FedericoPeraltaSchaffner well, that clears it, `resize` is potentially expensive when there's a `Tree` involved and entries move drastically around, and keys have hash collisions and keys don't implement `Comparable`, and... I think you get the picture – Eugene Oct 05 '18 at 19:42

6 Answers6

16

The default Load Factor is 0.75, i.e. 3/4, which means that the internal hash table will be resized when 75 of the 100 values have been added.

FYI: resize() is only called twice. Once when the first value is added, and once when it gets to 75% full.

To prevent resizing, you need to ensure that the 100th value will not cause resizing, i.e. size <= capacity * 0.75 aka size <= capacity * 3/4 aka size * 4/3 <= capacity, so to be sure:

capacity = size * 4/3 + 1

With size = 100, that means capacity = 134.

Andreas
  • 154,647
  • 11
  • 152
  • 247
  • Why is `resize()` called when the first item is added? Is the unmodifiable `emptyMap()` instance assumed up until that point? – Bobulous Oct 05 '18 at 18:37
  • 1
    Andreas, the resize() function _is_ being called 21 times. Now, perhaps the function is exiting before actually performing the resize, I will dig into that a little more. – James Wierzba Oct 05 '18 at 18:43
  • 3
    @Bobulous Correct. It's a memory saving optimization, since tests have shown that collection objects are often created without values ever being added, so collections are now created with minimal memory footprint, and the initial capacity isn't applied until first value is added. – Andreas Oct 05 '18 at 18:43
  • 2
    @JamesWierzba The `resize()` is only called once of twice, if you add 100 elements to a map initialized with a capacity of 100. If you create a small test program with nothing but the code in the question, beware that JVM startup creates *other* `HashMap` objects, so don't count `resize()` calls on all those other objects. Only count calls on the `m` instance. --- Tested on Java 10. What version are you running? – Andreas Oct 05 '18 at 18:45
  • @Andreas That is very interesting, I did not consider that the JVM might be using HashMap's as well. I will re-test and verify the callstack traces back up to my application. Also, my java version is 1.8 r181 – James Wierzba Oct 05 '18 at 18:47
  • 2
    @Andreas -- You are right! I confirmed that HashMap was called 19 times by whatever bootstrapping the JVM is doing, and 2 times by my code :) – James Wierzba Oct 05 '18 at 18:49
  • 1
    @JamesWierzba Yeah, I ran that test, and saw lots of `resize()`, but notice the call stack wasn't correct, so put *trigger* breakpoint on first line in my code, which reduced call count to 2. – Andreas Oct 05 '18 at 18:49
  • @Andreas two things, using that formula you get a lot more than u asked for (because power of two number of buckets), that *capacity* might mean [different things for different maps](https://stackoverflow.com/a/52671958/1059372) – Eugene Oct 05 '18 at 19:21
  • @Eugene The question and this answer is specifically about `HashMap`, not other map implementations. Also, since the question is about *performance*, an oversized hash table is preferable to an undersized one, because high load factor would increase hash bucket collisions, reducing performance. Besides, power of two is not the issue, re-hashing is. – Andreas Oct 05 '18 at 19:24
  • @Andreas high load factor meaning `> 0.75`? And btw when you say *size = 100, that means capacity = 134* - a little bit wrong, it's the next power of two from 100. Oh wow, who would downvote this? – Eugene Oct 05 '18 at 19:26
  • @Andreas "re-hashing" is not that complicated - one more bit is taken into consideration and thus some entries either stay where they are, or move to *a power of two offset in the new table* – Eugene Oct 05 '18 at 19:28
  • When I say *"With size = 100, that means capacity = 134"*, I'm talking about the parameter value given to the constructor, to ensure no re-hashing. It doesn't matter that `HashMap` internally round the value up to a power of two. Specifying `capacity = size * 4/3 + 1` *guarantees* no rehashing (assuming `size` is accurate). – Andreas Oct 05 '18 at 19:28
  • @Eugene "re-hashing" is not complicated, but can be *very expensive*, if calling `hashCode()` is expensive, since `hashCode()` is called on every key already in the map. – Andreas Oct 05 '18 at 19:29
  • 1
    @Andreas IIRC [it does not](https://stackoverflow.com/questions/45404580/hashmap-resize-method-implementation-detail), since it is stored already, so no need to re-compute it again (it's the first field in the `Node` class) – Eugene Oct 05 '18 at 19:33
  • @Eugene That was my exact question (in a [comment below the question](https://stackoverflow.com/questions/52671362/why-does-hashmap-resize-again-when-specifying-a-precise-capacity#comment92274222_52671362)). Btw I haven't downvoted either... Downvoter: care to explain? – fps Oct 05 '18 at 19:37
  • @Andreas even the sentence about *guarantees* no resize is a bit off. let's say I want 11 entries without a resize: `int initCapacity = (int) (1.0 + (long) 11 / 0.75); Map map = new HashMap<>(initCapacity);` and now let's test this for input `List list = List.of("DFHXR", "YSXFJ", "TUDDY", "AXVUH", "RUTWZ", "DEDUC", "WFCVW", "ZETCU", "GCVUR");`.... – Eugene Oct 05 '18 at 19:59
  • @Andreas nvm you said *rehashing* not *resize*; still a good example nevertheless – Eugene Oct 05 '18 at 20:05
8

When in doubt read the Documentation. The docs for HashMap explain the trade offs of initial capacity and load-factor quite well.

According to the documentation if initCapacity = (maxEntries / loadFactor) + 1 then no rehash operations will occur upon adding entries. Where in this case, the maxEntries is 100 as you specify and the loadFactor would be the default load factor of .75.

But aside from just setting an initial size to avoid a rehash (resize()) you should carefully read the documentation of HashMap in order to tune it properly, taking both initial capacity and load factor into account.

If you care about lookup cost more than space, then perhaps try with a lower loadFactors like .5 or lower if you like. In that case you would create your hash map with both parameters like this:

final float loadFactor = 0.5;
final int maxEntries   = 100;
final int initCapacity = (int) maxEntries / loadFactor + 1;
new HashMap<>(initCapacity, loadFactor);

(emphasis mine)

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal datastructures are rebuilt) so that the hash table has approximately twice the number of buckets.
...
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

xtratic
  • 4,600
  • 2
  • 14
  • 32
  • 1
    `0.75` is also `3/4`, so `(maxEntries / 0.75) + 1` is better written as `maxEntries * 4 / 3 + 1` *(no parens needed)*, to do it using `int` math *(no cast needed)*. – Andreas Oct 05 '18 at 18:48
  • 3
    @Andreas On the other hand, Java itself uses a float for the calculation. – Mark Rotteveel Oct 05 '18 at 18:57
  • 1
    @MarkRotteveel Sure, but `new HashMap<>(size * 4/3 + 1)` is shorter/simpler than `new HashMap<>((int) (size / 0.75 + 1))`, since you don't have to cast. – Andreas Oct 05 '18 at 19:09
  • @xtratic not entirely sure, but it seems you are confusing re-hash with re-size, these are different things – Eugene Oct 05 '18 at 20:03
  • @Eugene `resize()` acts as the `rehash` for Java's implementation of `HashMap` as far as I can tell. It may not calculate new hash codes but it does re-distribute the elements and in documentation and code it is treated as the rehash operation. – xtratic Oct 09 '18 at 14:28
1

This is easy to prove:

private static <K, V> void debugResize(Map<K, V> map, K key, V value) throws Throwable {

    Field table = map.getClass().getDeclaredField("table");
    AccessibleObject.setAccessible(new Field[] { table }, true);
    Object[] nodes = ((Object[]) table.get(map));

    // first put
    if (nodes == null) {
        map.put(key, value);
        return;
    }

    map.put(key, value);

    Field field = map.getClass().getDeclaredField("table");
    AccessibleObject.setAccessible(new Field[] { field }, true);
    int x = ((Object[]) field.get(map)).length;
    if (nodes.length != x) {
        ++currentResizeCalls;
    }
}

And some usage:

static int currentResizeCalls = 0;

public static void main(String[] args) throws Throwable {

    int size = 100;
    Map<Integer, String> m = new HashMap<>(size);
    for (int i = 0; i < size; i++) {
        DeleteMe.debugResize(m, i, String.valueOf(i));
    }

    System.out.println(DeleteMe.currentResizeCalls);
}     

I am only logging the times it takes when resize actually resizes, because the first call is initializing; as the documentation stipulates:

Initializes or doubles table size


The second of your points is far more interesting. A HashMap defines capacity, now what capacity is? And this is not that obvious:

For HashMap, capacity is the number of buckets before a resize happens, for ConcurrentHashMap it's the number of entries before resize is performed.

Thus, not to call resize internally, in case of HashMap use the formula:

(int)(1.0 + (long)initialCapacity / LOAD_FACTOR)

But this is by far not ideal, say you want 1024 entries without a resize, by using that formula you get to 1367 buckets, which internally are rounded to a power of two, thus 2048 - well, a lot more than you asked for.

For CHM specify the size directly. Easy to prove using one single modification in the previous code:

 // use CHM instead of HashMap
 Map<Integer, String> m = new ConcurrentHashMap<>(size);

This will result in zero resizes that actually double the array. But sometimes even CHM internal code is confusing and requires little patching.

Eugene
  • 117,005
  • 15
  • 201
  • 306
1

There are a lot of wonderful answers here. I greatly appreciate the contributions.

I have decided to not re-invent this wheel, because it appears google has already solved this problem.

I am going to use the utility method Maps.newHashMapWithExpectedSize(int) from Google's guava library

James Wierzba
  • 16,176
  • 14
  • 79
  • 120
  • have you looked under the implementation of it? it's no magic `return (int) ((float) expectedSize / 0.75F + 1.0F);` – Eugene Oct 07 '18 at 20:25
  • 1
    @Eugene Depends on your definition of magic, I suppose. If the load factor or any other implementation details change, i would expect the google library to take care of that. I wont have to. So that is another benefit – James Wierzba Oct 07 '18 at 20:30
  • 1
    of course, if you upgrade guava all the time and they keep up with the internal details all the time; but of course your choice, absolutely – Eugene Oct 07 '18 at 20:34
  • @JamesWierzba Wrong. First of all, if a developer uses someone else's code (even if it's Guava), he will be responsible for the code he has written (coz he could have written it himself but not to use someone else's code). Second, even though Java has backward compatibility, the internals of `HashMap` are very different in Oracle JDK 7 and Oracle JDK 8, and I don't see that Guava provides a different version of `Maps#newHashMapWithExpectedSize` for JDK 8+. – blackr1234 Sep 26 '21 at 18:39
  • @JamesWierzba Third, the Javadoc of the method states that "This behavior cannot be broadly guaranteed, but it is observed to be true for OpenJDK 1.7. **It also can't be guaranteed that the method isn't inadvertently oversizing the returned map.**" Moreover, in Oracle JDK 8, that `+ 1.0F` is actually not necessary. If `expectedSize` is `12`, the method gives a `HashMap` with capacity of `32` but not `16`. `16 * 0.75 = 12` and only when the `13`th entry is added, `HashMap` resizes itself to capacity of `32`. – blackr1234 Sep 26 '21 at 18:39
0
  • Resizing is an essential part of a hashmap's working to keep the load factor low.

  • Load factor needs to be low because the hashing function of the hashmap will inevitably start having collisions when hashmap's buckets are maxing out. The collisions can start from second entry itself if your entries hash to an occupied bucket every time.


However, in your particular case, collision is not a problem, only hashmap's resizing is.

Hashmap is generally resized at 0.75 ( = 3/4 in the fraction) load factor. Using this information, you can set up a hashmap of 4/3 times the count of the entries you need to store.


Regarding your dissent with breaking encapsulation:

I agree with you It is debatable.

You can say that it would be better if capacity represented the count of entries until which the resize does not happen, rather than the count of maximum possible entries that could be stored in the hashmap - and I tend to agree with you.

But someone else could also argue the other side as to why the hashmap is occupying more space than it was instructed to reserve.

The solution to this problem lies in Java's domain. Java can provide two constructors which are explicit enough about what they will do and then the developers can have their pick during their hashmap's initialization.

displayName
  • 13,888
  • 8
  • 60
  • 75
0

Java has many distributions and versions, so you have to test it yourself.

  • In Oracle JDK 7, HashMap gives unpredictable resizing results because it has 2 conditions for resizing to take place ((size >= threshold) && (null != table[bucketIndex])). First, size must be >= threshold (cap * load factor). Second, the current bucket must already have an entry, which implies a collision.
    • Run the code below and see for yourself.
      • When capacity is 4, resizing takes place when the 5th entry is put.
      • When capacity is 8, resizing takes place when the 9th entry is put.
      • When capacity is 16, resizing takes place when the 16th entry is put (not 17th).
      • When capacity is 32, resizing takes place when the 32th entry is put (not 33th).
      • When capacity is 64, resizing takes place when the 64th entry is put (not 65th).
  • In Oracle JDK 8, HashMap resizes when the size is > threshold (capacity * load factor).
    • With capacity of 16 and default load factor of 0.75, resizing (to capacity of 32) takes place when the 13th entry is put.
    • Run the code below and see for yourself.
      • When capacity is 4, resizing takes place when the 4th entry is put.
      • When capacity is 8, resizing takes place when the 7th entry is put.
      • When capacity is 16, resizing takes place when the 13th entry is put.
      • When capacity is 32, resizing takes place when the 25th entry is put.
      • When capacity is 64, resizing takes place when the 49th entry is put.
public class HashMapTest {

    public static void main(String[] args) {
        int cap = 4;
        int size = 64;
        Map<Integer, String> map = new HashMap<>(cap);

        for (int i=1; i<=size; i++) {
            map.put(i, i+"");
            print(map);
        }
    }

    public static void print(Map map) {
        try {
            Class<?> mapType = map.getClass();
            Method capacity = mapType.getDeclaredMethod("capacity");
            capacity.setAccessible(true);
            System.out.println("capacity : " + capacity.invoke(map) + "    size : " + map.size());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
blackr1234
  • 1,420
  • 12
  • 23