4

i want to cache some search results based on the object to search and some search settings.

However: this creates quite a long cache key, and i thought i'd create a shortcut for it, and i thought i'd use GetHashCode() for it.

So i was wondering, does GetHashCode() always generate a different number, even when i have very long strings or differ only by this: 'ä' in stead of 'a'

I tried some strings and it seemed the answer is yes, but not understanding the GetHashCode() behaviour doesn't give me the true feeling i am right.

And because it is one of those things which will pop up when you're not prepared (the client is looking at cached results for the wrong search) i want to be sure...

EDIT: if MD5 would work, i can change my code not to use the GetHashCode ofcourse, the goals is to get a short(er) string than the original (> 1000 chars)

Michel
  • 23,085
  • 46
  • 152
  • 242

5 Answers5

9

You CANNOT count on GetHashCode() being unique.

There is an excellent article which investigates the likelihood of collisions available at http://kenneththorman.blogspot.com/2010/09/c-net-equals-and-gethashcode.html . The findings were that "The smallest number of calls to GetHashCode() to return the same hashcode for a different string was after 565 iterations and the highest number of iterations before getting a hashcode collision was 296390 iterations. "

So that you can understand the contract for GetHashCode implementations, the following is an excerpt from MSDN documentation for Object.GetHashCode():

A hash function must have the following properties:

  • If two objects compare as equal, the GetHashCode method for each object must return the same value. However, if two objects do not compare as equal, the GetHashCode methods for the two object do not have to return different values.

  • The GetHashCode method for an object must consistently return the same hash code as long as there is no modification to the object state that determines the return value of the object's Equals method. Note that this is true only for the current execution of an application, and that a different hash code can be returned if the application is run again.

  • For the best performance, a hash function must generate a random distribution for all input.

Eric Lippert of the C# compiler team explains the rationale for the GetHashCode implementation rules on his blog at http://ericlippert.com/2011/02/28/guidelines-and-rules-for-gethashcode/ .

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
smartcaveman
  • 41,281
  • 29
  • 127
  • 212
  • But you can override GetHashCode and hash string with MD5 – opewix Sep 11 '12 at 09:41
  • hehe, that is a clear answer :-) (not what i want to hear though :-)) – Michel Sep 11 '12 at 09:41
  • @JesseJames using the MD5 hash will provide me the result i want? – Michel Sep 11 '12 at 09:42
  • 4
    @JesseJames and how do you return the 128 bit (64 byte) size of the MD5 hash through the `Int32` return value of `GetHashCode`. Apart from the fact that MD5 may not be unique as well, and calculating it could be worse than simply comparing the string itself. – Christian.K Sep 11 '12 at 09:43
  • @Christian.K, That's a question that will probably be helpful for a lot of other users to get clarification. Maybe you want to start a new question and link from here. That way it will be easier to find in search – smartcaveman Sep 11 '12 at 09:44
  • You will minimize collisions, but no unique keys are guaranteed until you will store original string. Read about MD5 or SHA-2 on wiki – opewix Sep 11 '12 at 09:44
  • @smartcaveman Actually that was a rhetorical question you cannot do it :-) – Christian.K Sep 11 '12 at 09:45
  • @Christian.K, I understand that. It doesn't make it less useful to know. – smartcaveman Sep 11 '12 at 09:50
8

Logically GetHashCode cannot be unique since there are only 2^32 ints and an infinite number of strings (see the pigeon hole principle).


As @Henk pointed out in the comment even though there are an infinite number of strings there are a finite number of System.Strings. However the pigeon hole principle still stands as the later is much larger than int.MaxValue.

Community
  • 1
  • 1
Motti
  • 110,860
  • 49
  • 189
  • 262
  • 2
    There is a limit on string-length so there is no inifinite number of strings either. But the number is a few order of magnitudes bigger than 2^32, yes. – H H Sep 11 '12 at 09:46
2

If one were store the hash code of each string along with the string itself, one could compare the hashcodes of strings as a "first step" to comparing them for equality. If two strings have different hashcodes, they're not equal, and one needn't bother doing anything else. If one expects to be comparing many pairs of strings which are of the same length, and which are "almost" but not quite equal, checking the hashcodes before checking the content may be a useful performance optimization. Note that this "optimization" would not be worthwhile if one did not have cached hashcodes, since computing the hashcodes of two strings would almost certainly be slower than comparing them. If, however, one has had to compute and cache the hashcodes for some other purpose, checking hash codes as a first step to comparing strings may be useful.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • +1 for *computing the hashcodes of two strings would almost certainly be slower than comparing them.*, not many realise it. Can I edit that into bold? – nawfal Dec 15 '13 at 17:03
1

You always risk collisions when using GetHashCode() because you are operating within a limited number space, Int32, and this will also be exacerbated by the fact that hashing algorithms will not perfectly distribute within this space.

If you look at the implementation of HashTable or Dictionary you will see that GetHashCode is used to assign the keys into buckets to cut down the number of comparisons required, however, the equality comparisons are still necessary if there are multiple items in the same bucket.

Slugart
  • 4,535
  • 24
  • 32
0

No. GetHasCode just provides a hash code. There will be collisions. Having different hashes means the strings are different, but having the same hash does not mean the strings are the same.

Read these guidlelines by Eric Lippert for correct use of GetHashCode, they are quite instructing.

If you want to compare strings, just do so! stringA == stringB works fine. If you want to ensure a string is unique in a large set, using the power of hash code to do so, use a HashSet<string>.

Falanwe
  • 4,636
  • 22
  • 37
  • No you cannot just simply compare them because LINQ querying operations insist on calling getHascode first before equals. – shawn1874 Dec 18 '14 at 19:31
  • @shawn1874 : And your point is? Yes, some LINQ implementations call GetHashCode before calling equals, but that's not a problem as long as the 2 are compatible, and they should be for any custom type you create, and they certainly are for `string`. – Falanwe Dec 22 '14 at 11:47