72

I understand that the String class' hashCode() method is not guarantied to generate unique hash codes for distinct String-s. I see a lot of usage of putting String keys into HashMap-s (using the default String hashCode() method). A lot of this usage could result in significant application issues if a map put displaced a HashMap entry that was previously put onto the map with a truely distinct String key.

What are the odds that you will run into the scenario where String.hashCode() returns the same value for distinct String-s? How do developers work around this issue when the key is a String?

Marcus Leon
  • 55,199
  • 118
  • 297
  • 429
  • 2
    some "measurements": http://www.drmaciver.com/2008/07/javalangstringhashcode/ – Zed Oct 04 '09 at 14:46
  • @Zed, good link. I didn't realize the likelyhood of collisions was that high. – Marcus Leon Oct 04 '09 at 15:02
  • 27
    You seem to have missed how HashMaps (or any hash table, in any language) works. It is not like hashCode is a unique index into an array. It is used to calculate which bucket to place an object in and it is usually decided by taking the value returned by hashCode() and do a modulus with the number of buckets. Even when there are no collisions at all, the map will have to call the equals() method to figure out if it is indeed the key you are searching for or just some other key that happens to have the same hash code. The hashcode is just used to find the right bucket to go search in. – Fredrik Oct 04 '09 at 15:09
  • @Fredrik, yes, I've got it now. – Marcus Leon Oct 04 '09 at 15:54
  • @Zed: Note that the list of collisions in that post is for all lower-case. Changing the case would result in a different set of collisions. – erickson Oct 04 '09 at 18:03

5 Answers5

118

Developers do not have to work around the issue of hash collisions in HashMap in order to achieve program correctness.

There are a couple of key things to understand here:

  1. Collisions are an inherent feature of hashing, and they have to be. The number of possible values (Strings in your case, but it applies to other types as well) is vastly bigger than the range of integers.

  2. Every usage of hashing has a way to handle collisions, and the Java Collections (including HashMap) is no exception.

  3. Hashing is not involved in equality testing. It is true that equal objects must have equal hashcodes, but the reverse is not true: many values will have the same hashcode. So don't try using a hashcode comparison as a substitute for equality. Collections don't. They use hashing to select a sub-collection (called a bucket in the Java Collections world), but they use .equals() to actually check for equality.

  4. Not only do you not have to worry about collisions causing incorrect results in a collection, but for most applications, you also *usually* don't have to worry about performance - Java hashed Collections do a pretty good job of managing hashcodes.

  5. Better yet, for the case you asked about (Strings as keys), you don't even have to worry about the hashcodes themselves, because Java's String class generates a pretty good hashcode. So do most of the supplied Java classes.

Some more detail, if you want it:

The way hashing works (in particular, in the case of hashed collections like Java's HashMap, which is what you asked about) is this:

  • The HashMap stores the values you give it in a collection of sub-collections, called buckets. These are actually implemented as linked lists. There are a limited number of these: iirc, 16 to start by default, and the number increases as you put more items into the map. There should always be more buckets than values. To provide one example, using the defaults, if you add 100 entries to a HashMap, there will be 256 buckets.

  • Every value which can be used as a key in a map must be able to generate an integer value, called the hashcode.

  • The HashMap uses this hashcode to select a bucket. Ultimately, this means taking the integer value modulo the number of buckets, but before that, Java's HashMap has an internal method (called hash()), which tweaks the hashcode to reduce some known sources of clumping.

  • When looking up a value, the HashMap selects the bucket, and then searches for the individual element by a linear search of the linked list, using .equals().

So: you don't have to work around collisions for correctness, and you usually don't have to worry about them for performance, and if you're using native Java classes (like String), you don't have to worry about generating the hashcode values either.

In the case where you do have to write your own hashcode method (which means you've written a class with a compound value, like a first name/last name pair), things get slightly more complicated. It's quite possible to get it wrong here, but it's not rocket science. First, know this: the only thing you must do in order to assure correctness is to assure that equal objects yield equal hashcodes. So if you write a hashcode() method for your class, you must also write an equals() method, and you must examine the same values in each.

It is possible to write a hashcode() method which is bad but correct, by which I mean that it would satisfy the "equal objects must yield equal hashcodes" constraint, but still perform very poorly, by having a lot of collisions.

The canonical degenerate worst case of this would be to write a method which simply returns a constant value (e.g., 3) for all cases. This would mean that every value would be hashed into the same bucket.

It would still work, but performance would degrade to that of a linked list.

Obviously, you won't write such a terrible hashcode() method. If you're using a decent IDE, it's capable of generating one for you. Since StackOverflow loves code, here's the code for the firstname/lastname class above.


public class SimpleName {
    private String firstName;
    private String lastName;
    public SimpleName(String firstName, String lastName) {
        super();
        this.firstName = firstName;
        this.lastName = lastName;
    }
    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result
                + ((firstName == null) ? 0 : firstName.hashCode());
        result = prime * result
                + ((lastName == null) ? 0 : lastName.hashCode());
        return result;
    }
    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        SimpleName other = (SimpleName) obj;
        if (firstName == null) {
            if (other.firstName != null)
                return false;
        } else if (!firstName.equals(other.firstName))
            return false;
        if (lastName == null) {
            if (other.lastName != null)
                return false;
        } else if (!lastName.equals(other.lastName))
            return false;
        return true;
    }
}

CPerkins
  • 8,968
  • 3
  • 34
  • 47
  • Does your example assume that each object in the map correctly implements the equals() method which lets the HashMap differentiate between the objects? – Marcus Leon Oct 04 '09 at 14:59
  • Be careful - this could be construed as "write your own version of String.hashCode()" which is probably not a good idea. I know that is not what you mean but I would recommend removing that part of your answer to avoid confusion. – finnw Oct 04 '09 at 15:00
  • @Marcus - correctly? Yes, there must be a "correct" equals method - either the one inherited from Object, which is an identity comparison, or build one yourself. Presence of an *incorrect* .equals method is a Bad Thing. – CPerkins Oct 04 '09 at 15:03
  • 1
    @finnw - thanks, but remember, I'm calling it a degenerate case. In fact, you can't write a version of String.hashCode, since String is final. Still, I don't like to appear to give bad advice. Do you have any suggestions for clarification? – CPerkins Oct 04 '09 at 15:04
  • @Marcus, yes, equals() must be implemented correctly for HashMap to work correctly. That is why tools like FindBugs will generate warnings if you override hashCode without overriding equals (or vice-versa.) – finnw Oct 04 '09 at 15:05
  • Sorry to get into this thread late, finnw could you respond to CPerkins and using string as a key? – Berlin Brown Nov 04 '09 at 23:26
  • 1
    It's worth noting that if one knows the hashcodes of two objects which one wishes to check for equality, it's often useful to check the hash codes for equality and only test objects themselves for equality if they match. The Java hashed collections do this, to avoid time on equality tests for items which have different hashes but ended up in the same bucket. – supercat Feb 07 '14 at 23:41
6

I direct you to the answer here. While it is not a bad idea to use strings( @CPerkins explained why, perfectly), storing the values in a hashmap with integer keys is better, since it is generally quicker (although unnoticeably) and has lower chance (actually, no chance) of collisions.

See this chart of collisions using 216553 keys in each case, (stolen from this post, reformatted for our discussion)

Hash           Lowercase      Random UUID  Numbers 
=============  =============  ===========  ==============
Murmur            145 ns      259 ns          92 ns
                    6 collis    5 collis       0 collis
FNV-1a            152 ns      504 ns          86 ns
                    4 collis    4 collis       0 collis
FNV-1             184 ns      730 ns          92 ns
                    1 collis    5 collis       0 collis*
DBJ2a             158 ns      443 ns          91 ns
                    5 collis    6 collis       0 collis***
DJB2              156 ns      437 ns          93 ns
                    7 collis    6 collis       0 collis***
SDBM              148 ns      484 ns          90 ns
                    4 collis    6 collis       0 collis**
CRC32             250 ns      946 ns         130 ns
                    2 collis    0 collis       0 collis

Avg Time per key    0.8ps       2.5ps         0.44ps
Collisions (%)      0.002%      0.002%         0%

Of course, the number of integers is limited to 2^32, where as there is no limit to the number of strings (and there is no theoretical limit to the amount of keys that can be stored in a HashMap). If you use a long (or even a float), collisions will be inevitable, and therefore no "better" than a string. However, even despite hash collisions, put() and get() will always put/get the correct key-value pair (See edit below).

In the end, it really doesn't matter, so use whatever is more convenient. But if convenience makes no difference, and you do not intend to have more than 2^32 entries, I suggest you use ints as keys.


EDIT

While the above is definitely true, NEVER use "StringKey".hashCode() to generate a key in place of the original String key for performance reasons- 2 different strings can have the same hashCode, causing overwriting on your put() method. Java's implementation of HashMap is smart enough to handle strings (any type of key, actually) with the same hashcode automatically, so it is wise to let Java handle these things for you.

Community
  • 1
  • 1
dberm22
  • 3,187
  • 1
  • 31
  • 46
  • Note that "collision" inevitability has little to do with the key space size of the type. The "hasCode" is actually processed again down to a very small integer. This is because hashmap is an "array list" of "linked lists" internally, and the hash-code is processed to be used as the array index. See the nTableMask section of this article for a good example of another (PHP) implementation: https://nikic.github.io/2012/03/28/Understanding-PHPs-internal-array-implementation.html The performance benefit of integer is instead due to "hashCode" and "equals" function cost differences. – user1122069 Aug 23 '16 at 21:20
5

I strongly suspect that the HashMap.put method does not determine whether the key is the same by just looking at String.hashCode.

There is definitely going to be a chance of a hash collision, so one would expect that the String.equals method will also be called to be sure that the Strings are truly equal, if there is indeed a case where the two Strings have the same value returned from hashCode.

Therefore, the new key String would only be judged to be the same key String as one that is already in the HashMap if and only if the value returned by hashCode is equal, and the equals method returns true.

Also to add, this thought would also be true for classes other than String, as the Object class itself already has the hashCode and equals methods.

Edit

So, to answer the question, no, it would not be a bad idea to use a String for a key to a HashMap.

coobird
  • 159,216
  • 35
  • 211
  • 226
  • Of course equals is also used, but as soon as you have multiple objects within the same bucket, your O(1) is more like O(n)... – Zed Oct 04 '09 at 14:49
  • @Zed: That's why it's important to have an appropriate size for collections which rely on a hash to determine where they physically store their values. I believe the Hash* implementations in Java go for a load factor of 0.75 for their storage. – coobird Oct 04 '09 at 14:52
  • 1
    @Marcus: javadocs does specify that equals is used: http://java.sun.com/javase/6/docs/api/java/util/HashMap.html#get%28java.lang.Object%29 – Zed Oct 04 '09 at 14:52
4

This is not an issue, it's just how hashtables work. It's provably impossible to have distinct hashcodes for all distinct strings, because there are far more distinct strings than integers.

As others have written, hash collisions are resolved via the equals() method. The only problem this can cause is degeneration of the hashtable, leading to bad performance. That's why Java's HashMap has a load factor, a ratio between buckets and inserted elements which, when exceeded, will cause rehashing of the table with twice the number of buckets.

This generally works very well, but only if the hash function is good, i.e. does not result in more than the statistically expected number of collisions for your particular input set. String.hashCode() is good in this regard, but this was not always so. Allegedly, prior to Java 1.2 it only inlcuded every n'th character. This was faster, but caused predictable collisions for all String sharing each n'th character - very bad if you're unluck enough to have such regular input, or if someone want to do a DOS attack on your app.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
2

You are talking about hash collisions. Hash collisions are an issue regardless of the type being hashCode'd. All classes that use hashCode (e.g. HashMap) handle hash collisions just fine. For example, HashMap can store multiple objects per bucket.

Don't worry about it unless you are calling hashCode yourself. Hash collisions, though rare, don't break anything.

Keith Randall
  • 22,985
  • 2
  • 35
  • 54
  • However the Javadocs for the put method say "If the map previously contained a mapping for this key, the old value is replaced." This seems to contradict the suggestion that HashMap "can store multiple objects per bucket". No? – Marcus Leon Oct 04 '09 at 14:54
  • 2
    @Marcus, No, a hash collision does not always mean a key collision. The value will be replaced only if the strings are exactly equivalent. If the strings have the same hashcode then it will still work. A bucket is just a linked list and can store many distinct strings with the same hash. – finnw Oct 04 '09 at 14:59