Best practices on what should be key in a hashtable

Question

The best look-up structure is a HashTable. It provides constant access on average (linear in worst case).
This depends on the hash function. Ok.
My question is the following. Assuming a good implementation of a HashTable e.g. HashMap is there a best practice concerning the keys passed in the map?I mean it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.
Example the size of the key? For example in a good hashmap (in the way described above) if we used String as keys, won't the "bottleneck" be in the string comparison for equals (trying to find the key)? So should the keys be kept small? Or are there objects that should not be used as keys? E.g. a URL? In such cases how can you choose what to use as a key?

I would say that the major constraint is that it has to be the thing that you want to base the lookup on ;) — Oliver Charlesworth, Jan 16 '13 at 20:21
The general rule is to use as the key whatever you need to look up with. You're not going to find a more efficient data structure in most cases without a lot of work, so you shouldn't bother fussing over it too much. — Louis Wasserman, Jan 16 '13 at 20:22
The _size_ of the key does not matter. Important is, how efficient it's `hashCode()` calculation is. — jlordo, Jan 16 '13 at 20:22
@LouisWasserman:I started to wonder this,since I saw a report from some of the commercial code-assesment tools pointing out that a hashmap with `URL` as a key is a potential performance issue. — Cratylus, Jan 16 '13 at 20:25
java.net.URL has a hashCode() method that is both synchronized and really complicated, so it could be a reasonable recommendation to key such a map on the actual string value of the URL if it's practical. I don't know how to scale that information into an answer to your much more broad question though :) — Affe, Jan 16 '13 at 20:31
The String hashCode calculation is linear from the length of the String. For the URL it is the same. However, when searching within a bucket you actually call the equals method log2(bucket_len) times, so if you have a very large Hash with very long Strings as keys - you might see it would work slightly longer. But the pernalty would not be too severy, so you should use whatever is more usefull — Petr, Jan 16 '13 at 20:32
@Petr:Why is it `log(bucket_len)`.It is a linked list.It should be `O(N)`, right? — Cratylus, Jan 16 '13 at 20:34
@Affe however the hashCode for the URL is cashed after it is calculated for the first time and for the String it is not cashed. So it is not so obvious what would work faster. — Petr, Jan 16 '13 at 20:35
URL's equals method is horribly broken and under no circumstances should you use it as a map key. — artbristol, Jan 16 '13 at 20:51
Should go without saying, but nothing that uses the default `hashCode` and `equals` implementations should be used as a key. — Alex, Jan 16 '13 at 22:26
After deciding what the key logically should be, the next step is to pick a data structure. If, for example, the keys are the integers in the range 0 through 99, a simple hundred element array would be much better than a hash table. — Patricia Shanahan, Jan 17 '13 at 04:31

Luigi R. Viggiano · Answer 1 · 2013-01-17T03:10:01.167

The best performing key for an HashMap is probably an Integer, where hashCode() and equals() are implemented as:

public int hashCode() {
    return value;
}

public boolean equals(Object obj) {
    if (obj instanceof Integer) {
        return value == ((Integer)obj).intValue();
    }
    return false;
}

Said that, the purpose of an HashMap is to map some object (value) to some others (key). The fact that a hash function is used to address the (value) objects is to provide fast, constant-time access.

it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.

The recommendation is to Map objects to what you need: don't think what is faster; but think what is the best for your business logic to address the objects to retrieve.

The important requirement is that the key object must be immutable, because if you change the key object after storing it in the Map it may be not possible to retrieve the associated value later.

The key word in HashMap is Map. Your object should just map. If you sacrifice the mapping task optimizing the key, you are defeating the purpose of the Map - without probably achieving any performance boost.

I 100% agree with the first two comments in your question:

the major constraint is that it has to be the thing that you want to base the lookup on ;)
– Oli Charlesworth

The general rule is to use as the key whatever you need to look up with.
– Louis Wasserman

Remember the two rules for optimization:

Don't.
(for experts only) don't yet.

The third rule is: profile before to optimize.

Surprising that Integer uses the value as the hash code, as this will have a terrible distribution for most applications. I think HashMap accounts for poorly distributed hashes, but in general one shouldn't rely on that behavior. — Alex, Jan 16 '13 at 22:17

score 1 · Answer 2 · answered Jan 16 '13 at 20:32

You should use whatever key you want to use to lookup things in the data structure, it's typically a domain-specific constraint. With that said, keep in mind that both hashCode() and equals() will be used in finding a key in the table.

hashCode() is used to find the position of the key, while equals() is used to determine if the key you are searching for is actually the key that we just found using hashCode().

For example, consider two keys a and b that have the same hash code in a table using separate chaining. Then a search for a would require testing if a.equals(key) for potentially both a and b in the table once we find the index of the list containing a and b from hashCode().

gaborsch · Answer 3 · 2013-01-17T09:31:55.830

I digged up the implementation. I had an assumption that the effectiveness of the hashCode() method will be the key factor.

When I looked into the HashMap() and the Hashtable() implementation, I found that the implementation is quite similar (with one exception). Both are using and storing an internal hash code for all entries, so that's a good point that hashCode() is not so heavily influencing the performance.

Both are having a number of buckets, where the values are stored. It is important balance between the number of buckets (say n), and the average number of keys within a bucket (say k). The bucket is found in O(1) time, the content of the bucket is iterated in O(k) size, but the more bucket we have, the more memory will be allocated. Also, if many buckets are empty, it means that the hashCode() method for the key class does not the hashcode wide enough.

The algorithm works like this:

Take the `hashCode()` of the Key (and make a slight bijective transformation on it)
Find the appropriate bucket
Loop through the content of the bucket (which is some kind of LinkedList)
Make the comparison of the keys as follows:
1. Compare the hashcodes 
    (it is calculated in the first step, and stored for the entry)
2. Examine if key `==` the stored key (still no call) 
    (this step is missing from Hashtable)
3. Compare the keys by `key.equals(storedKey)`

To summarize:

hashCode() is called once per call (this is a must, you cannot do without it)
equals() is called if the hashCode is not so well spread, and two keys happen to have the same hashcode

The same algorithm is for get() and put() (because in put() case you can set the value for an existing key). So, the most important thing is how the hashCode() method was implemented. That is the most frequently called method.

Two strategies are: make it fast and make it effective (well-spread). The JDK developers made efforts to make it both, but it's not always possible to have them both.

Numeric types are good
Object (and non-overriden classes) are good (hashCode() is native), except that you cannot specify an own equals()
String is not good, iterates through the characters, but caches after that (see my comment below)
Any class with synchronized hashCode() is not good
Any class that has an iteration is not good
Classes that have hashcode cache are a bit better (depends on the usage)

Comment on the String: To make it fast, in the first versions of JDK the String hash code calculation was made for the first 32 characters only. But the hashcode it produced was not well spread, so they decided to take all the characters into the hashcode.

score 0 · Answer 4 · answered Jan 17 '13 at 00:46

it is recommended that the key must be an immutable object but I was wondering if there are other recommendations.

The key of the value should be final.

Most times a field of the object is used as key. If that field changes then the map cannot find it:

void foo(Employee e) {
  map.put(e.getId(), e);
  String newId = e.getId() + "new";
  e.setId(newId);
  Employee e2 = e.get(newId);
  // e != e2 !
}

So Employee should not have a setId() method at all, but that is difficult because when you are writing Employee you don't know what it will be keyed by.

Best practices on what should be key in a hashtable

4 Answers4

Linked