2

I have data that I want to lookup by key.

My particular use case is that the data (key/value and number of elements) does not change once the map is initialised. All key/value values are known at once.

I have generally use a HashMap for this with default constructor (default initial capacity and load factor).

What is the best way build this Map? If I was to use HashMap, what should the default initial capacity and load factor be set to? Is Map.copyOf() a better solution? Does the size of the map matter (20 elements vs 140,000)?

This article https://docs.oracle.com/en/java/javase/15/core/creating-immutable-lists-sets-and-maps.html#GUID-6A9BAE41-A1AD-4AA1-AF1A-A8FC99A14199 seems to imply that non mutable Map returned by Map.copyOf() is more space efficient.

  • I would use an immutable map so a future programmer (including yourself) does not forget that the map should never change. – SephB Jul 19 '22 at 17:09
  • I think you need to provide a little more context, as the answer can change depending on, among other things, how this map is expected to be created. Will all of the key/value pairs be known at one time? Or will the map be built up over the course of several insertions before needing to become immutable? This previous question may help: https://stackoverflow.com/questions/22636575/unmodifiablemap-java-collections-vs-immutablemap-google – anqit Jul 19 '22 at 17:40

1 Answers1

2

HashMap is fairly close to optimal in most cases already. The array of buckets doubles in capacity each time, so it's most wasteful when you have (2^N) + 1 items, since the capacity will necessarily be 2^(N+1) (i.e. 2049 items require capacity of 4096, but 2048 items fit perfectly).

In your case, specifying an initial size will only prevent a few reallocations when the map is created, which if it only happens once probably isn't relevant. Load factor is not relevant because the map's capacity will never change. In any case, if you did want to pre-size, this would be correct:

new HashMap<>(numItems, 1);

Does the size of the map matter (20 elements vs 140,000)?

It will have an impact, but not a massive one. Items are grouped into buckets, and buckets are structured as lists or trees. So the performance is mostly dependent on how many items are in a given bucket, rather than the total number of items across all buckets.

What's important is how evenly distributed across your buckets the items are. A bad hash code implementation will result in clustering. Clustering will start to move O(1) operations towards O(log n), I believe.

// The worst possible hashCode impl.
@Override
public int hashCode() { return 0; } // or any other constant 

If you have the same items in the map across multiple invocations of your application (not clear from the question if that's the case), and if the class of the key is under your control, then you have the luxury of being able to tweak the hashCode implementation to positively affect the distribution, e.g. by using different prime numbers as a modulus. This would be trial and error, though, and is really only a micro-optimization.

As for the comments/answers addressing how to confer immutability, I'd argue that that's a separate concern. First work out what map is actually optimal, then worry about how to confer immutability upon it, if it isn't already. You can always wrap a mutable map in Collections.unmodifiableMap. Supposedly Guava's ImmutableMap is slower than HashMap, and I suspect other immutable variants will struggle to exceed the performance of HashMap too.

Michael
  • 41,989
  • 11
  • 82
  • 128
  • Thank you for this great info. The Immutable part of the question is not realy about the Unmodifiable aspect of the map. It is more about Java 10's Map.copyOf() which creates a new type of Map (ImmutableCollections.MapN). This Map is documented as being more space efficient. There are also possible optimisation knowing the immutable nature of this collection (I don't know). – Charles Quinlan Jul 19 '22 at 19:07
  • 2
    @CharlesQuinlan It might be more space efficient but the performance is worse, at least for a big map. I ran a microbenchmark, and HashMap outperformed copyOf by 50%, just doing `get`s on a map with 100k entries. As is often the case, optimising for memory negatively affects performance, and vice versa. Here is the benchmark code in case you want to run it yourself: https://ideone.com/4lE2LR You'll need to import the JMH dependencies via Maven or Gradle though – Michael Jul 19 '22 at 19:49
  • I doubled checked your benchmark and got similar results. This is a bit counterintuitive given that the Map.copyOf() method exists since Java 10 and has no reason to be slower than the corresponding HashMap implementation. – Charles Quinlan Jul 19 '22 at 20:58
  • @CharlesQuinlan I'm not sure "no reason" is accurate. MapN is structured as an array of key, then value, alternating (i.e. `[K,V,K,V,K,V]`). To find a key it simply iterates until `equals` is true, then takes the following element to get the associated value. Now here is my speculation: the key difference is that allows HashMap to work faster is right there in the name: hash. I think conceptually, `MapN` is supposed to provide constant performance regardless of any properties of the keys (i.e. uniqueness of hash). It simply isn't allowed to care about hash at all. (1/2) – Michael Jul 19 '22 at 22:22
  • @CharlesQuinlan So that benefits HashMap in 2 respects. HashMap has *fewer comparisons* to do, because it can take the hash code of the lookup key and jump straight to the right bucket. But also the comparisons it does have to do (i.e. more than 1 item in a bucket) are *also* faster, because comparing hashcodes is comparatively faster than doing `equals` on every key. (2/2) – Michael Jul 19 '22 at 22:25
  • Thank you Micheal for taking the time. It is worst than this. In the MapN implementation, If key0 maps to array index 0 and key1 maps to array index 1 and key2 also maps to array index 0. Looking for key2 will take three comparisons. – Charles Quinlan Jul 19 '22 at 23:27
  • Actually, MapN does [use the key's hashcode](https://github.com/openjdk/jdk18/blob/master/src/java.base/share/classes/java/util/ImmutableCollections.java#L1187). It stores the map in an array as you said, but the array is in fact indexed by hashcode. Otherwise, `Map.copyOf` would have returned a map that had _O(N)_ time complexity and users would be horrified. – Klitos Kyriacou Jul 20 '22 at 10:16
  • 1
    @KlitosKyriacou Yes it uses the key's hashcode to compute the array index but there are no buckets when two or more keys map to the same index. The values are just added to the next available spot (non null) in the array. In my example, key2 maps to array index 0 but the value is actually stored at position 2 requiring three comparisons. The situation gets even worst. If I was looking for a missing key that also maps to array index 0, I would need the same 3 comparisons and continue comparing until I find a null value in the array comparing mapping of duplicate index 0, 1, 2 and possible more. – Charles Quinlan Jul 20 '22 at 12:12
  • 1
    @CharlesQuinlan Also has a knock-on effect. If you insert two keys at index 2, then the second is shifted to index 4. Then you try to insert a genuine key at index 4, it will have to go to index 6; 4 is taken. HashMap doesn't have that knock-on issue. – Michael Jul 20 '22 at 12:19