Mapping from String to integer - performance of various approaches

Question

Let's say that I need to make a mapping from String to an integer. The integers are unique and form a continuous range starting from 0. That is:

Hello -> 0
World -> 1
Foo   -> 2
Bar   -> 3
Spam  -> 4
Eggs  -> 5
etc.

There are at least two straightforward ways to do it. With a hashmap:

HashMap<String, Integer> map = ...
int integer = map.get(string); // Plus maybe null check to avoid NPE in unboxing.

Or with a list:

List<String> list = ...
int integer = list.indexOf(string); // Plus maybe check for -1.

Which approach should I use, and why? Arguably the relative performance depends on the size of the list/map, since List#indexOf() is a linear search using String#equals() -> O(n) efficiency, while HashMap#get() uses hash to narrow down the search -> certainly more efficient when the map is big, but maybe inferior when there are just few elements (there must be some overhead in calculating the hash, right?).

Since benchmarking Java code properly is notoriously hard, I would like to get some educated guesses. Is my reasoning above correct (list is better for small, map is better for large)? What is the threshold size approximately? What difference do various List and HashMap implementations make?

You may want to change to `HashMap` in the sample code snippet. — aioobe, Oct 21 '10 at 09:49
Third option: Use an enum (http://stackoverflow.com/questions/604424/java-enum-converting-string-to-enum) and defer to the implementation's guess as to which is quicker (or its internal hyper-optimised implementation, which it may or may not have). — T.J. Crowder, Oct 21 '10 at 09:50
@aioobe: Thanks, clarified that. @T.J: enum is a good idea, but works only when the mapping is known already at compile-time. — Joonas Pulakka, Oct 21 '10 at 09:53
If you are using the same String objects throughout the application, String.intern() and IdentityHashMap will provide good performance. However, you *must* intern your strings and this technique only makes sense if your application lets you hold onto those String references so you only need to intern them once each. — ide, Oct 21 '10 at 10:07
@ide: Interning (or using string literals - they're automatically interned) is an important point. If fact, now that I look at HashMap's implementation, equals() check if only used if s1.hashCode()==s2.hashCode && s1!=s2, which is a relatively rare case (hash collision) if both s1 and s2 are interned. Additionally, String instances calculate their hash code only once and cache them thereafter. Thus a plain HashMap should provide very good performance for interned Strings. — Joonas Pulakka, Oct 22 '10 at 10:54
And to add to the previous: at least ArrayList#indexOf() always uses equals(), and although String#equals() has an identity check which is quick for same identity (interned objects), anything with not same identity is subjected to an expensive character-by-character comparison. Referring to my original question, I think we can conclude that a HashMap is almost always a better choice than an (Array)List. — Joonas Pulakka, Oct 22 '10 at 11:20
Aside: The culprit is not just equals() -- you would like to avoid the hashCode() calls and perhaps the memory overhead of all those cached hash codes. In practice, I have measured a 15% improvement in speed by switching from HashMap to IdentityHashMap, but there are few applications where it is applicable. — ide, Oct 22 '10 at 19:20

score 6 · Accepted Answer · edited Feb 08 '17 at 14:30

6

A third option and possibly my favorite would be to use a trie:

I bet it beats the HashMap in performance (no collisions + the fact that computing the hash-code is O(length of string) anyway), and possibly also the List approach in some cases (such as if your strings have long common prefixes, as the indexOf would waste lot of time in the equals methods).

When choosing between List and Map I would go for a Map (such as HashMap). Here is my reasoning:

Readability

The Map interface simply provides a more intuitive interface for this use case.
Optimization in the right place

I'd say if you're using a List you would be optimizing for the small cases anyway. That's probably not where the bottle neck is.

A fourth option would be to use a LinkedHashMap, iterate through it if the size is small, and get the associated number if the size is large.

A fifth option is to encapsulate the decision in a separate class all together. In this case you could even implement it to change strategy in runtime as the list grows.

edited Feb 08 '17 at 14:30

Community

1
1

answered Oct 21 '10 at 09:53

aioobe

413,195
112
811
826

Your trie has coalesced some of the letters into a single node (a trie with one letter per node would have an `O(length-of-string)` lookup time) - that's going to be a nightmare to write (each insert would have to coalesce / expand nodes as it looked for the right place)! – Nicholas White Oct 21 '10 at 12:24
Hum, no, perhaps the picture looks like it, but the text in the node merely describes the string from the root to that node. – aioobe Oct 21 '10 at 12:26
@aioobe I have the same tendency to use tries for these scenarios but, they are not thread-safe. How can you use tries and still get the benefits of using a highly concurrent data structure such as ConcurrentHashMap ? – Ajay Jul 03 '12 at 23:37

score 4 · Answer 2 · answered Oct 21 '10 at 09:51

4

You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.

I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.

answered Oct 21 '10 at 09:51

duffymo

305,152
44
369
561

HashMaps are not O(1), as you may have multiple values having the same hash value. – Thorbjørn Ravn Andersen Oct 21 '10 at 10:56
@Thorbjørn Ravn Andersen: yes, but if you choose the hash function carefully, you can keep it down to a minimum. – Chii Oct 21 '10 at 11:06
@Thorbjørn Ravn Andersen - they are typically described as `O(1)` on average, but the worst case is `O(N)`. Unless you've got a bad hash function, the probability of the worst case becomes vanishingly small as `N` increases. – Stephen C Oct 21 '10 at 11:13
@Stephen The O-function does not describe the average, but the worst case. – Thorbjørn Ravn Andersen Oct 21 '10 at 11:18
@Chii, And doing so requires knowing the dataset up front, yes? – Thorbjørn Ravn Andersen Oct 21 '10 at 11:18
2

@Thorbjørn Ravn Andersen - not true. It describes what it describes. For example: http://en.wikipedia.org/wiki/Quicksort#Average_complexity – Stephen C Oct 21 '10 at 11:23
@Stephen, please link to the correct information. http://en.wikipedia.org/wiki/Big_O_notation. " big O notation (also known as Big Oh notation, Landau notation, Bachmann–Landau notation, and asymptotic notation) describes the limiting behavior of a function when the argument tends towards a particular value or infinity". Anything depending on the value of n (which the hash bucket size does) is more than O(1) – Thorbjørn Ravn Andersen Oct 21 '10 at 12:10
1

@Thorbjørn Ravn Andersen hashmap has internal enhanced hash function that prevents collision and falling in to same bucket for different keys. – Dead Programmer Oct 21 '10 at 12:22
1

@Thorbjørn Ravn Andersen - java.util.HashMap resizes the map using a load factor. This reduces the *average* number of collisions per bucket to a value that is *independent of the number of map entries `N`*. Hence the `O(1)` time per lookup using `HashMap`. – Stephen C Oct 21 '10 at 15:22

score 4 · Answer 3 · answered Oct 21 '10 at 09:51

Your question is totally correct on all points:

HashMaps are better (they use a hash)
Benchmarking Java code is hard

But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.

One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.

Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.

Stephen C · Answer 4 · 2010-10-21T11:02:39.850

Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how String.equals() and String.hashCode() are implemented, as well as the details of the HashMap and List classes used.

In the case of a HashMap, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. The equals operations typically will typically examine all of the characters when equals returns true and considerably less when it returns false. The actual number of times that equals is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".

In the case of a List, a lookup will call equals for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.

In addition to the trie alternative suggested by @aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to String.equals. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call to equals.)

The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.

But the reality is that simply using a HashMap is a safe option, because it gives O(1) performance with a relatively small constant of proportionality (unless the entry keys are pathological).

(FWIW, my guess is that the break-even point where HashMap.get() becomes better than List.contains() is less than 10 entries, assuming that the strings have an average length of 5 to 10.)

You could add that the number of "hits" and "misses" can be tweaked by setting the load-factor. — aioobe, Oct 21 '10 at 10:25
@aioobe - that tweaks the number of hashtable collisions. My "hits" and "misses" are about whether the input string can be mapped (a hit) or not (a miss). The hit/miss ratio is actually rather significant if a `List` is used. — Stephen C, Oct 21 '10 at 10:48

score 1 · Answer 5 · answered Oct 21 '10 at 09:57

From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.

I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.

I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.

Nicholas White · Answer 6 · 2010-10-21T12:01:15.840

I think a HashMap will always be better. If you have n strings each of length at most l, then String#hashCode and String#equals are both O(l) (in Java's default implementation, anyway).

When you do List#indexOf it iterates through the list (O(n)) and performs a comparison on each element (O(l)), to give O(nl) performance.

Java's HashMap has (let's say) r buckets, and each bucket contains a linked list. Each of these lists is of length O(n/r) (assuming the String's hashCode method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate the hashCode (O(l)), look up the bucket (O(1) - one, not l), and iterate through that bucket's linked list (O(n/r) elements) doing an O(l) comparison on each one. This gives a total lookup time of O(l + (nl)/r).

As the List implementation is O(nl) and the HashMap implementation is O(nl/r) (I'm dropping the first l as it's relatively insignificant), lookup performance should be equivalent when r=1 and the HashMap will be faster for all greater values of r.

Note that you can set r when you construct the HashMap using this constructor (set the initialCapacity to r and the loadFactor argument to n/r for your given n and chosen r).

Mapping from String to integer - performance of various approaches

6 Answers6

Linked