Probability of getting a duplicate value when calling GetHashCode() on strings

Question

I want to know the probability of getting duplicate values when calling the GetHashCode() method on string instances. For instance, according to this blog post, blair and brainlessness have the same hashcode (1758039503) on an x86 machine.

Are you wanting to avoid collisions? It's fairly uncommon, but not impossible of course. — James Michael Hare, Nov 01 '11 at 15:29
Yes, I wanted to know the probability of getting collisions. — Diego, Nov 01 '11 at 15:34
The question is under-specified. First off, there are an infinite number of possible strings. You need to say what *probability distribution* you are using over the space of all possible strings. Second, you don't say what size of set you are interested in. Suppose you have n strings. Are you interested in the probability of hash collision of those n strings with *a particular string*, or are you interested in whether *any* string in the set collides with *any other*? Those are *very* different analyses. — Eric Lippert, Nov 01 '11 at 15:55
You are right, I'm interested in the posibility of collisions between any pair of strings. But considering the answers now I know that I can't rely on GetHashCode() to generate an unique value for a given string. — Diego, Nov 01 '11 at 16:00
Oh, absolutely you cannot rely on *uniqueness*. There are only four billion possible hash values but obviously there are way more than four billion possible strings; the strings "0", "1", ... "5000000000" alone are five billion strings. You can't put an infinite number of pigeons into four billion pigeonholes without one of those pigeonholes having more than one pigeon! — Eric Lippert, Nov 01 '11 at 16:10
That's what I wanted to know. We were using GetHashCode() as a way to get unique hashes for a set of strings. Thanks! — Diego, Nov 01 '11 at 19:00
@Diego: Pro-Tip: Do not use `GetHashCode` of any class for anything except hash tables. — Brian, Nov 02 '11 at 14:01
Lots of answers about hypothetical probabilities, but no one is mentioning how GetHashCode() is implemented which I find odd. e.g. see http://stackoverflow.com/questions/15174477/how-is-gethashcode-of-c-sharp-string-implemented. — Martin Capodici, May 04 '16 at 05:46

Eric Lippert · Accepted Answer · 2020-01-13T18:22:07.297

42

Large.

(Sorry Jon!)

The probability of getting a hash collision among short strings is extremely large. Given a set of only ten thousand distinct short strings drawn from common words, the probability of there being at least one collision in the set is approximately 1%. If you have eighty thousand strings, the probability of there being at least one collision is over 50%.

For a graph showing the relationship between set size and probability of collision, see my article on the subject:

https://learn.microsoft.com/en-us/archive/blogs/ericlippert/socks-birthdays-and-hash-collisions

edited Jan 13 '20 at 18:22

answered Nov 01 '11 at 16:00

Eric Lippert

647,829
179
1,238
2,067

3

+1 to both :) - you and Jon have answers to 2 different probabilities - Jon: "narrowing down the possible matches for a particular string" - low prob. of collision, Eric: "at least one collision in the set" - high. – Alexei Levenkov Nov 01 '11 at 16:57
2

I'm with Alexei: it depends on how you interpret the question. I absolutely agree with everything Eric's said, but stand by my (on the surface entirely contrary) answer too. Will edit to clarify, of course. – Jon Skeet Nov 01 '11 at 17:11
I'll choose this as the accepted answer since it best suits the scenario I have. Thanks @JonSkeet and Jeremy McGee for your answers. They helped too. – Diego Nov 01 '11 at 19:06

Jon Skeet · Answer 2 · 2011-11-01T17:13:59.573

Small - if you're talking about the chance of any two arbitrary unequal strings having a collision. (It will depend on just how "arbitrary" the strings are, of course - different contexts will be using different strings.)

Large - if you're talking about the chance of there being at least one collision in a large pool of arbitrary strings. The small individual probabilities are no match for the birthday problem.

That's about all you need to know. There are definitely cases where there will be collisions, and there have to be given that there are only 2³² possible hash codes, and more than that many strings - so the pigeonhole principle proves that at least one hash code must have more than one string which generates it. However, you should trust that the hash has been designed to be pretty reasonable.

You can rely on it as a pretty good way of narrowing down the possible matches for a particular string. It would be an unusual set of naturally-occurring strings which generated a lot of collisions - and even when there are some collisions, obviously if you can narrow a candidate search set down from 50K to fewer than 10 strings, that's a pretty big win. But you must not rely on it as a unique value for any string.

Note that the algorithm used in .NET 4 differs between x86 and x64, so that example probably isn't valid on both platforms.

I tried the example I provided in an x64 machine and the hashes are different, I will edit the question, thanks for pointing that. — Diego, Nov 01 '11 at 15:54
that's better. Fence sit in the face of imprecise specifications ;) — ShuggyCoUk, Nov 01 '11 at 17:40
@ShuggyCoUk: I'm trying to *straddle* the fence rather than sitting on top of it :) — Jon Skeet, Nov 01 '11 at 17:41
@Jon, watch out. If you get that wrong you can have a splittting...headache — ShuggyCoUk, Nov 01 '11 at 18:11

score 14 · Answer 3 · answered Nov 01 '11 at 15:29

I think all that's possible to say is "small, but finite and definitely not zero" -- in other words you must not rely on GetHashCode() ever returning unique values for two different instances.

To my mind, hashcodes are best used when you want to tell quickly if two instances are different -- not if they're the same.

In other words, if two objects have different hash codes, you know they are different and need not do a (possibly expensive) deeper comparison.

However, if the hash codes for two objects are the same, you must go on to compare the objects themselves to see if they're actually the same.

score 5 · Answer 4 · answered Mar 09 '18 at 09:39

5

I ran a test on a database of 466k English words and got 48 collisions with string.GetHashCode(). MurmurHash gives slightly better results. More results are here: https://github.com/jitbit/MurmurHash.net

answered Mar 09 '18 at 09:39

Alex from Jitbit

53,710
19
160
149

Thanks! I will check that. – Diego Mar 14 '18 at 16:45

Dale · Answer 5 · 2011-11-11T18:53:41.133

Just in case your question is meant to be what is the probability of a collision in a group of strings,

For n available slots and m occupying items:
Prob. of no collision on first insertion is 1.
Prob. of no collision on 2nd insertion is ( n - 1 ) / n
Prob. of no collision on 3rd insertion is ( n - 2 ) / n
Prob. of no collision on mth insertion is ( n - ( m - 1 ) ) / n

The probability of no collision after m insertions is the product of the above values: (n - 1)!/((n - m)! * n^(m - 1)).

which simplifies to ( n choose k ) / ( n^m ).

And everybody is right, you can't assume 0 collisions, so, saying the probability is "low" may be true but doesn't allow you to assume that there will be no collisions. If you're looking at a hashtable, I think the standard is you begin to have trouble with significant collisions when you're hashtable is about 2/3rds full.

Sam Skuce · Answer 6 · 2011-11-01T16:29:14.787

-1

The probability of a collision between two randomly chosen strings is 1 / 2^(bits in hash code), if the hash is perfect, which is unlikely or impossible.

edited Nov 01 '11 at 16:29

answered Nov 01 '11 at 15:35

Sam Skuce

1,666
14
20

Define "random" in the context of strings. I would contend that in real life, strings of fewer than 10 characters occur more often than strings of more than ten million characters. – Jon Skeet Nov 01 '11 at 15:55
@JonSkeet, fair enough, I've qualified that it's two randomly chosen strings, rather than truly random strings. – Sam Skuce Nov 01 '11 at 16:15
1

But now that means the hash has to be designed knowing what kinds of strings will be involved. I suspect that a line-of-business app and a DNA-profiling app will have very different sets of strings. I'm just trying to highlight that designing a hash to *really* have uniform distribution for every real life use case is extremely hard or impossible. Having a "pretty good" distribution is probably good enough for most people. – Jon Skeet Nov 01 '11 at 16:20

Probability of getting a duplicate value when calling GetHashCode() on strings

6 Answers6

Linked

Related