What is the optimal hashmap capacity/load factor for 5-10M entries?

Question

I have about 5-10M entries in a HashMap, and I can't change the code structure. I'm running java with -Xms=512m -Xmx=1024m. What is the optimal capacity/load-factor values in the HashMap constructor to avoid java.lang.OutOfMemoryError: GC overhead limit exceeded?

private final Map<String, ReportResultView> aggregatedMap = new HashMap<>(????, ????);

There isn't much you can do by tuning these parameters. You can possibly try avoiding map resizing, but that means you have to pre-size for the worst case (the largest number of entries you may need). Changing the load factor is not worthwhile. — Marko Topolnik, Sep 23 '16 at 09:29
If you have 10M entries, the best capacity is 10M, but don’t worry. It *will* have (at least) that capacity once you added all 10M entries, that’s unavoidable. The value you pass to the constructor is the *initial capacity*, which will mainly affect initial insertion performance, when being too low, but nothing more. Once the values have been added, it has no impact anymore. — Holger, Sep 23 '16 at 09:34
@holger It affects peak memory usage which happens during resizing. Also note the crappy API where "capacity" isn't the number of entries you can insert before resizing, but the raw size of the bucket array. You have to divide by load factor to get the size that makes sense to the user. — Marko Topolnik, Sep 23 '16 at 09:44
@Marko Topolnik: it’s more complicated than that. If you specify 10M, you’ll actually get >16M capacity due to the rounding to the next power of two, so with any load factor >0.6 (the default is 0.75), you can put 10M entries without resize. Note that this is not the “crappy API” but the “crappy implementation” we’re talking about. — Holger, Sep 23 '16 at 10:20
It depends on your goal. If you specify 10M as initial capacity, but put only 5M entries, you’re obviously wasting space. If you specify 5M as initial capacity, but put 10M entries, you will have exactly one rehash operation in-between. So maximum performance is your goal, you might want to avoid that rehash operation, but if having 5M-6M entries is more likely and you want to reduce average memory footprint, you might accept the one rehash operation that might occur in some scenarios. — Holger, Sep 23 '16 at 10:31
@holger It's crappy HashMap API: with or without the additional complication, the fact remains that if you want to ensure at least N entries before resizing, you have to supply N/loadFactor for initial capacity. You may get a larger map due to implementation details (which I would _not_ call "crappy"). — Marko Topolnik, Sep 23 '16 at 10:48
@Marko Topolnik: it would be no problem to change the *implementation* to respect the initial capacity as long as the size is smaller and only use the load factor for *subsequent* resize thresholds. In fact, that would be one line of code in the current (Java 8) implementation. Since the formula “specified size / specified factor” doesn’t work today (as said, 10M already means you can insert 10M without rehash), that wouldn’t change the *API*, just an implementation detail. — Holger, Sep 23 '16 at 10:59
@holger Well, the subsequent resize threshold is what this is all about. If you supply initial capacity 32, load factor 0.5, you'll get a resize at 16 elements. And it's not implementation detail: "capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created" — Marko Topolnik, Sep 23 '16 at 11:00
@Marko Topolnik: and it would be no problem to change it to “If you supply initial capacity 32, load factor whatever, you'll get the first resize at 32 elements”. Though using the *known* ratio of collisions rather than the size/table ratio would be much smarter. And, as said, the specified *initial capacity* doesn’t match the actual capacity, even today. So it wouldn’t be a problem if it doesn’t match (in a different way) tomorrow. This *is* an implementation detail. — Holger, Sep 23 '16 at 11:10
@Holger That would break the definition of load factor: "load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased". You'd have to change the Javadoc specification, but what you suggest would be very broken behavior. Working with the actual collision rate is something to consider, although it could get messy. — Marko Topolnik, Sep 23 '16 at 11:14
@holger The specification would have to spell out that initial capacity of C guarantees C entries without resizing. If you don't have that specified, the actual behavior that _happens_ to guarantee it doesn't matter much. — Marko Topolnik, Sep 23 '16 at 11:19
@Marko Topolnik: see how the `concurrencyLevel` parameter of `ConcurrentHashMap` became meaningless or how the current `HashMap` depends on a correct `Comparable` implementation, if the keys do implement that interface. The API is still fulfilled in that the implementation does what it is supposed to do, but parameters which are merely performance tuning hints, may become ignored or re-interpreted. Again, as said multiple times, the exact wording of the doc has been already violated in the very first implementation. — Holger, Sep 23 '16 at 11:19
Yes, that new guaranty would be useful for new programs, but the discussion was about whether the old (“crappy”) API prevents such a change. — Holger, Sep 23 '16 at 11:25
@holger My complaint was the useful behavior that the API _doesn't guarantee_, not behavior the API _prevents_. — Marko Topolnik, Sep 23 '16 at 11:30
I see. I get your point, though I wouldn’t go so far to call an API “crappy” when it doesn’t guaranty a behavior which is merely a performance detail. Especially, when we don’t have a guaranty that avoiding rehashing (but accepting more collisions on the other hand) is actually the better behavior. — Holger, Sep 23 '16 at 11:35
@Holger In the library I work on, rehashing is a huge deal because maps are huge. It's basically a stop-the-world event on the scale of tens of seconds and harbors a major threat of OOME because during resizing the bucket array temporarily occupies 150% of what it will have after resizing. The only point of allowing pre-sizing is avoiding this risk and HashMap doesn't give you the needed guarantee here. — Marko Topolnik, Sep 23 '16 at 14:47
you could use `NonBlockingHashMap` from https://github.com/boundary/high-scale-lib to avoid the long-running resize operations. — the8472, Sep 23 '16 at 14:53
@Marko Topolnik: if it’s mission-critical, you might consider using a different hash map implementation, i.e. there don’t have to be rehash operations at all. — Holger, Sep 23 '16 at 15:05
@MarkoTopolnik The HashMap spec says: "If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur." I agree it's crappy that the programmer has to do division to provide the right value to the API. But is there a case where even after providing the right value, adding that number of entries still causes a rehash? — Stuart Marks, Sep 23 '16 at 15:14
Most of the discussion regarding "optimal capacity" above is off-base. The idea seems to be that the optimal underlying array size (aka capacity) for a `HashMap` of `N` elements is `N`, and the only real issue is cutting through the load factor cruft to get there. `HashMap` is _not_ `ArrayList`! For arrays, extra capacity is only useful for avoid frequent resizes, and doesn't improve performance for other operations. So if you know your `List` has `N` elements and won't increase you want an `N` element backing array. — BeeOnRope, Sep 23 '16 at 15:57
`HashMap`s, however, are different - the empty elements are useful for reducing the collision rate and hence the average chain size. So you generally want a larger capacity than `N`, _even ignoring the resizing issue_. So @Holger's initial comment about that the ideal capacity for a 10M element map is 10M is off base, I think. If that were the case, the default load factor would be something like 1.34 (so the load factor between resizes would vary from 0.67 to 1.34, with an average of ~1). In fact it is 0.75 since _chaining reduces access performance_. So there is a space/time tradeoff here. — BeeOnRope, Sep 23 '16 at 15:59
There also seems to be an embedded assumption that `1` is the maximum load factor. If your only goal is to reduce space, you can use a load factor like 5. This will save 80% of the array space, and have an average chain length of ~5. You'd only have a 2M array backing 10M elements. However, as my answer below indicates, the array size is likely _not_ the problem here, and is just a red herring. In a `HashMap` of 10m elements, the backing array is too small (and GC friendly) to cause issues with a heap of 1 GB. — BeeOnRope, Sep 23 '16 at 16:03
@BeeOnRope: there is no such simple relationship between load factor and collision rate. You can have a load factor of `0.1` with 100% collisions and you can have a load factor of `1` and no collisions. This depends on how the hash codes of the current keys distribute over the particular table size and whether they will distribute better with a doubled table size, which is not necessarily the case. — Holger, Sep 23 '16 at 16:35
Of course there are pathological cases and nothing is simple, but the general (statistically speaking) behavior _is_ simple, unless your hash function is terrible. If your hash function _is_ terrible, stop now and fix it! My claim is that the ideal capacity for a hash table is certainly not `N` in general (outside, for e.g., of pre-calculated perfect hash functions). For any type of open hashing, it is certainly much less than 1. For closed hashing (used by `HashMap`) it can be higher, and depends also on your desired space-time tradeoff. — BeeOnRope, Sep 23 '16 at 16:58
@StuartMarks You have precisely described the extent of my "crappiness" complaint---that you have to divide by load factor to get the useful guarantee. The point did get smeared around in the long-winded discussion. — Marko Topolnik, Sep 23 '16 at 17:00
FWIW, most of the `HashMap` implementations use power of two backing array sizes can't have _more_ collisions after doubling. The worst case is simply the same number of collisions, but this would require a pathologically bad hash function to achieve (pretty much a malicious one given the smearing the JDK does). — BeeOnRope, Sep 23 '16 at 17:00
@BeeOnRope You seem to have reversed the usage of "open hashing" and "closed hashing". `HashMap` is an example of _open hashing_, whereas open-addressed hashtables use _closed hashing_. — Marko Topolnik, Sep 23 '16 at 17:05
@BeeOnRope: it seems you are overlooking that the key space is not identical to the table size. If you are putting `Integer` keys, which have a perfect hash function, into the `HashMap`, there are 2³² possible keys, so when the current table size is `16`, there are `2²⁸ == 268435456` keys which would end up in the same bucket and the “smearing” only changes *which* keys, not the amount. So when I put `8` keys, it is perfectly possible to have all `8` from the `268435456` keys that end up in the same bucket. Having a load factor of `≤0.5` implies reducing to `2²⁷ == 134217728` keys per bucket… — Holger, Sep 23 '16 at 17:40
@BeeOnRope: now think about the `>65536³¹` possible string keys. There is no “pathologically bad hash function” required to have collisions. By the way, the “smearing function” has been changed several times in the `HashMap`. Now, it has been simplified to `h ^ (h>>>16)`, simply because more complicated “smearing” doesn’t pay off. — Holger, Sep 23 '16 at 17:44
Of course I'm not claiming there aren't collisions! I'm claiming that your scenarios of 10M keys either all colliding or have zero collisions are pathological. More formally my claim is simply that the average chain length (equivalently, the collision rate) is a _function_ of the load factor (you can look up the closed form formulas in any decent resource). So, unlike `ArrayList`, the load factor is an important tunable for get/put performance, and a claim that you simply set the load factor to 1 if the map size is known are false. — BeeOnRope, Sep 23 '16 at 19:25
@the8472 Cliffc's work is relevant for highly concurrent applications; I'm talking about single-threaded performance. Also, Click's implementation fares quite badly in an insertion-only benchmark without pre-sizing. — Marko Topolnik, Sep 24 '16 at 07:30
@BeeOnRope: That’s the problem with your reasoning. Somehow, you got into thinking that anybody here claimed, that setting the load factor to `1` was a good idea. Scroll up and look again. I never said that. I just said, that the *capacity* should be just `10M` for `10M`, not that the OP should play around with the load factor in any way. The default is `0.75`, not `1`. Not that it matters for this practical case, as specifying `10M` as initial capacity yields a capacity of `16M` practically, for all load factors between `0.6` and `1`. — Holger, Sep 26 '16 at 12:02
@BeeOnRope: in other words, if you specify `n` for `n`, you already get excess space for two different reasons, the load factor and/or rounding to power of two. But this excess space is not the reason for the OP’s memory issues, in that regard, we agree anyway. That’s why I said, the OP should specify `n` for `n` and stop worrying about the capacity. And even if the OP specified something else, it would be merely a performance issue, not a memory issue. That’s what my first comment was about. — Holger, Sep 26 '16 at 12:06
Right, I was referring to your "10M objects should ideally have a capacity of 10M" statement, which is _equivalent_ to a load factor of 1 - since size/capacity is how load factor is _defined_. Even aside from the vagaries of Java's particular resizing implementation, I'm saying that's an unsupported statement. There is nothing special about having a backing array of exactly that size (very much unlike the List case). — BeeOnRope, Sep 26 '16 at 15:59

score 6 · Answer 1 · edited May 23 '17 at 10:34

Summary: In this scenario the load factor might seem interesting, but it can't be the underlying cause of your OOMEs since the load factor only controls the wasted backing array space, and by default (load factor of 0.75) that only consumes ~2.5% of your heap (and doesn't cause high-object count GC-pressure). More likely, the space used by your stored objects and their associated HashMap.Entry objects has consumed the heap.

Details: The load factor for a HashMap controls the size of the underlying array of references used by the map. A smaller load factor means fewer empty array elements at a given size. So in general, increasing the load factor results in less memory use, since there are fewer empty array slots.³

That established, however, it is unlikely you can solve your OOMEs by adjusting the load factor. An empty array element, however, only "wastes" 4 bytes¹. So for an array of 5M-10M elements, a load factor of 0.75 (the default), will waste something like 25 MB of memory².

That's only a small fraction of the 1,024 MB of heap memory you are allocating, so you aren't going to be able to solve your OOMEs by adjusting your load factor (unless you were using something very silly, like an extremely low load factor of 0.05 or something). The default load factor will be fine.

Most likely it is actual size of the objects and object Entrys stored in the HashMap that is causing the problem. Each mapping has a HashMap.Entry object that holds the key/value pair and a couple of other fields (eg the hashcode, and a pointer to the next item when chained). This Entry object itself consumes about 32 bytes - when added to the 4 bytes for the underlying array entry, that's 40 bytes * 10M entries = 400M of heap for the overhead of the entries alone. Then the actual objects you are storing take space too: if your object has even a handful of fields, they will be at least as large as the Entry objects and your heap is pretty much exhausted.

The fact that you are getting a GC limit exceeded error rather than a heap alloc failed generally means you are approaching the heap limit slowly, churning a lot of objects: the GC tends to fail in that way in that scenario, before running out of space.

So most likely you simply need to allocate more heap to your application, find a way of storing fewer elements, or reduce the per-element size (e.g., with a different data structure or object representation).

[1] Usually 4 bytes on HotSpot anyway, even when running the 64-bit JDK - although it may be 8 bytes on some 64-bit platforms if compressed oops is disabled for some reason.

[2] Worst case, 0.75 load factor means a load of 0.75 / 2 = 0.375 after resize, so you have (1 - 0.375) * 10,000,000 empty elements, at 4 bytes per element = ~25 MB. During rehash you could add another factor of 1.5 or so, in the worst case, since both the old and new backing arrays will be on the heap simultaneously. When the map sizes stabilizes though, this doesn't apply.

[3] This is true even with chaining, since in general the use of chaining doesn't increase memory use (i.e., the Entry elements already have the "next" pointer embedded regardless if the element is in the chain or not). Java 8 complicates things since the HashMap implement was improved such that large chains may be converted into trees, which may increase the footprint.

score 0 · Answer 2 · answered Sep 23 '16 at 14:55

0

to avoid java.lang.OutOfMemoryError: GC overhead limit exceeded?

When hashmap resizes it needs to reallocate the internal table. So you need give your VM enough memory to juggle that temporary copy or presize the hashmap to prevent the resizing from ever occurring.

You could also take a look at the hashmap implementation from https://github.com/boundary/high-scale-lib which should provide less disruptive resizing behavior.

answered Sep 23 '16 at 14:55

the8472

40,999
5
70
122

2

Unlike `ArrayList` resizing HashMaps doesn't have a big transient impact on the memory use, because only the backing array is resized. The `Entry` objects are simply moved over. The `Entry` objects (not even including the underlying key or value objects stored) are an order of magnitude larger than the backing arrays, so the resize spike isn't high (unlike `ArrayList` where the backing array is 100℅ of the overhead). – BeeOnRope Sep 23 '16 at 21:19

What is the optimal hashmap capacity/load factor for 5-10M entries?

2 Answers2