5

I have short Strings (less than 10 characters). I will convert it into int and use it as primary key. (I can't use String primary key because of small problems.) I know the hashcodes of Strings of infinite length can collide, but can short Strings collide too?

Ypnypn
  • 933
  • 1
  • 8
  • 21
Cinakyn
  • 648
  • 1
  • 7
  • 20
  • http://stackoverflow.com/questions/10102337/can-javas-hashcode-produce-same-value-for-different-strings – user2864740 Sep 12 '14 at 01:17
  • See http://www.drmaciver.com/2008/07/javalangstringhashcode/ - *lots* of collisions from Scrabble words. Some statistics: "There are 3844 alphanumeric strings of size 2. Of these 3570 collide with at least one other string. That is, [only] 274 of these strings (or about 7% of them) *don’t* collide with something else. Oh well. *It’s a good thing no one would be stupid enough to rely on hashCode to distinguish the contents of two objects.*" – user2864740 Sep 12 '14 at 01:20

2 Answers2

13

Absolutely yes. For example, Ea and FB are colliding strings, each only two characters in length! Example:

public static final void main(String[] args) {
    System.out.println("Ea".hashCode() + " " + "FB".hashCode());
}

Prints 2236 2236.


The Java String#hashCode function isn't really even close to random. It's really easy to generate collisions for short strings, and it doesn't fare much better on long strings.

In general, even if you stuck to just 6 bits per character (ASCII letters and numbers, and a couple of symbols), you'd exceed the possible values of the 32-bit hash code with only a 6-character string -- that is, you would absolutely guarantee collisions among the 2^36 6-character 6-bit strings.

nneonneo
  • 171,345
  • 36
  • 312
  • 383
  • +1 for a direct [counter-]example. It would be interesting to see a string/hash collision graph, which probably does exist already.. – user2864740 Sep 12 '14 at 01:12
  • And in fact, this can be used to create a denial-of-service attack that turns HashMap's O(1) behavior into O(N): http://stackoverflow.com/questions/8669946 – yshavit Sep 12 '14 at 02:55
  • @yshavit problem solved with Java 8, released a few months before your comment ;^) – Holger Dec 13 '19 at 10:20
  • @Holger What made you pop on this question all of a sudden? :-P – yshavit Dec 13 '19 at 17:28
  • @yshavit searched for some other, string hashing related stuff and this was within the search results – Holger Dec 14 '19 at 15:19
5

A hash code is 32 bits in size.

A char in Java is 16 bits in size.

So in theory, all 2-character strings could have different hash codes, although some of those hash codes would have to collide with the hash code of the empty string and single-character strings. So even taking "all strings of two characters or shorter" there will be collisions. By the time you've got 10 characters, there are way more possible strings than there are hash codes available.

Collisions are still likely to be rare, but you should always assume that they can happen.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    +1 for pointing out that its not (necessarily) simply java having a stupid hash function, but a fundamental issue. – Michael Anderson Sep 12 '14 at 01:20
  • 1
    Also probably worth noting that "rare" doesn't mean that rare. Using the birthday paradox, a collision wouldn't be unusual with a collection of 16k items, even with a well distributed hash. (See http://en.wikipedia.org/wiki/Birthday_attack for accurate numbers.) – Michael Anderson Sep 12 '14 at 01:24
  • @MichaelAnderson: I guess it depends on what you mean by rare. The chances of there being *at least one collision* may be pretty high, but the chances of any specific pair colliding are fairly small. Regardless, I'm saying that you've got to account for it anyway, so it doesn't matter too much ;) – Jon Skeet Sep 12 '14 at 01:39
  • @Holger I think you're missing the point I was trying to make. The other answers show that the distribution of Javas hash values for strings are bad for simple strings. This answer points out that even with a perfect hash distribution, we should expect collisions. – Michael Anderson Dec 15 '19 at 23:54