24

I try to pay attention to good performance and clean code all the time.

I'm having difficulties trying to grasp whether it's sane to have a HashMap with keys of 150 characters.

  • Is there an unwritten law to the length of the HashMap key?
  • Is it considered bad practice to have String keys of let's say 150 characters?
  • Does it affect performance? At which length?
Caroline
  • 920
  • 2
  • 8
  • 25
  • Each successful get operation is inevitably linear in the key length. I think that is the only concern that needs taking into account. That still doesn't mean that 15-char keys will be 10 times faster to look up. – Marko Topolnik May 12 '13 at 11:17

4 Answers4

19

Not really, 150 chars String is relatively trivial to calculate an hashCode for.

That being said, in circumstances like this I would advise you to test it!

Create a routine that populates an HashMap with, say, insert a size here that is representative of your use scenario random values with 5 character strings as keys. Measure how long it takes. Then do the same for 15 character keys, and see how it scales.

Also, Strings in Java are immutable, which means that hashCode can be cached for each String that is stored in the String Constant Pool, and doesn't need to be recalculated when you call hashCode on the same String object.

This means that although you're calculating larger hash codes when creating your map, on access many of those will already be pre-calculated and cached, making the size of the original String even less relevant.

pcalcao
  • 15,789
  • 1
  • 44
  • 64
  • I would pick a size which is representative of the sort of size you are likely to have. – Peter Lawrey May 12 '13 at 10:59
  • I did a quick test on my machine. It took about 24 ms to compute the hash code of 1000 strings, each with the length of 10000 characters. – hongliang Feb 20 '14 at 19:49
  • 1. `String.hashCode` gets cached upon the first computation, unrelated to the pool. The computation itself takes [4 cycles per char](http://stackoverflow.com/q/21745619/581205), which is 4 times more than the optimum. 2. Because of this caching, the equals method will probably dominate the time. – maaartinus Sep 13 '14 at 07:11
9

Is there an unwritten law to the length of the HashMap key?

If there is, it is also unspoken. I would measure your use case in a profiler and only worry about the things you can measure as a problem, not the things you can imagine might be a problem.

Is it considered bad practice to have String keys of let's say 150 characters?

I doubt it.

Does it affect performance? At which length?

Everything affects performance, usually to small to matter or sometimes even measure. The question should be; do you need 150 character keys. If you do, then use them.


There is an exotic case where adding strings with hashCode() of zero is a bad idea. This is because in Java 1.0 to 6 doesn't optimise the use case of a hashCode of zero and it can be predicted for denial of service attacks. Java 7 fixes this by having a secondary, less predictable hashcode.

Why doesn't String's hashCode() cache 0?

Community
  • 1
  • 1
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • 1
    nit-picking here, but java uses substantially more than 1 byte per character. – radai May 12 '13 at 10:57
  • @radai True, even with `-XX:+UseCompressedStrings` which can use one byte per character. – Peter Lawrey May 12 '13 at 10:58
  • Another nitpick: Java 7's `HashMap` fixes it, not `String`. `String#hashCode` is basically unchanged. – Marko Topolnik May 12 '13 at 11:12
  • @MarkoTopolnik Java 7's String fixes it by supporting hash32() http://java-performance.info/changes-to-string-java-1-7-0_06/ Changing String is key to how this is cached. – Peter Lawrey May 12 '13 at 11:22
  • True... however `int32` is just a private helper method and the whole mechanism is not a part of the public API, or even publicly accessible. – Marko Topolnik May 12 '13 at 11:28
  • Not constructive. Standard tautological "benchmark it yourself" reply. – Aleksandr Dubinsky May 12 '13 at 11:35
  • @MarkoTopolnik In that case, the same applied to HashMap. – Peter Lawrey May 12 '13 at 13:03
  • 2
    My point: the whole mechanism is inaccessible outside of `HashMap` and other Java Collection Framework classes. Whatever logic you have of your own that relies on `Object#hashCode`, you can't profit from the enhancement. This is something to be aware of. – Marko Topolnik May 13 '13 at 08:11
  • 2
    And Java 8 unfixes it by removing `hash32`. Which is OK due to using tree bins in case of many conflicts. – maaartinus Sep 13 '14 at 07:16
6

Long answer: A quick look at the source code of String::hashCode() reveals that the hash is cached after the first call. Meanwhile, String::equals() is O(n) if the strings are equal yet not identical (ie, equals() is true but == is false because they're allocated at different addresses).

So the impacts on performance you will see are with:

  • Passing never-before-hashed strings in calls to HashMap functions. However, generating lots of new strings will impact performance in itself.

  • Calls to HashMap::get() and HashMap::put()using a string key that is equal to a key already in the HashMap (because if the key isn't in the collection, then most likely only hashCode() will be called. But if it is, equals() will compare all characters until it determines the strings are equal). But only if the strings passed to these functions are not the same objects that are already in the HashMap, because in that case equals() is very fast.

  • Moreover, string literals, string constants, and manually intern()'d strings join the String Constant Pool, in which all "equal" strings are the same object with the same address. So if working exclusively with such strings, hashCode and equals are very fast.

Of course, the performance impact won't be at all noticeable unless you're doing the aforementioned operations in a tight loop (because 150 chars isn't long and hashCode() and equals() are both efficient).

Short answer: Benchmark.

Aleksandr Dubinsky
  • 22,436
  • 15
  • 82
  • 99
4

First, there is no "unwritten rule". If long strings as keys make sense from an algorithmic perspective, use them. If profiling indicates that there is a problem, then you optimize.

So how can long strings affect hash table performance?

  • Long strings take more memory than short ones, and that could result in measurably longer garbage collection times, and other secondary performance effects related to hardware memory caches, TLBs and (potentially) physical memory page contention.

  • The hashcode algorithm for String uses all characters of the string and therefore its cost is proportional to the string length. This is mitigated by the fact that String hashcodes are cached. (The 2nd and subsequent time you call hashcode on a String, you get the cached value.) However, that only helps (here) if you do multiple hash table operations with the same String object as a key.

  • When you get a hash collision, the hash table falls back to using String.equals() to compare keys while searching the selected hash chain. In the worst case (e.g. when the strings are equal but not ==), String.equals() involves comparing all characters of the 2 strings.

As you can see, these effects are going to be specific to the actual application, and hence they are hard to predict. Hence an "unwritten rule" is unlikely to be helpful.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216