2

Could anybody help me to understand, how the GetHashCode() method works for string values?

From MSDN I found:

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

So, different strings can return the same hash code, hashcode is not unique for strings. Can it lead to bugs in the core of a program?

Warlock
  • 7,321
  • 10
  • 55
  • 75
  • I've written an answer in another post which should hopefully clarify the issue for you: https://stackoverflow.com/a/35217051/4880924 – BenKoshy Oct 06 '18 at 08:06

5 Answers5

3

It can lead to bugs if you assume that a matching hash code means a matching string. Typically, you use the hash code to sort strings into buckets for quick searching and selection. If you detect that two strings have a matching hash code, you would then typically compare the strings themselves for equality.

If that doesn't answer your question, then I don't understand the question.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • 1
    Thank you for your answer, David. As I understand, hashcode is required to identify, that objects are different, but since it is possible, that two objects have the same hashcode value, does it mean, that something wrong could be happened in the future? – Warlock Sep 17 '12 at 18:31
  • 1
    It's not required. It's good practice to override the hash values if you override the equality function as you would usually expect equal objects to return the same hash value. Also some things rely on this, like hash maps (Dictionary in .NET) – Mene Sep 17 '12 at 18:34
  • 2
    @Warlock: I don't understand the question. Two strings having the same hash code even if they're not identical does not indicate something wrong. It's like two people having the same initials but different names. – David Schwartz Sep 17 '12 at 18:42
  • Ok, David, you are right. I believe that now things are in their right places. Really hash code is used to speed up searching of objects. I missed this detail and it became as a misunderstanding. – Warlock Sep 17 '12 at 19:03
3

Well, this can lead to bugs, if your algorithms rely on each string to have a unique hash-value.

For example a hash map (Dictionary in .NET) might fail on collisions (i.e. two objects with the same hash that are not equal), or it doesn't if it handles collisions, that depends on the exact implementation. Fail in that case means: If you add a new object to the map and there is already an object in the map that has the same hash value as the new object, then the new object will override the old one instead of just being added. As far as I know the Dictionary class in .NET can handle collisions though.

If you need more concrete advice you'll need to ask a more concrete question: what are you trying to archive, how are you planning to do it etc.

As a side-note: hash values for strings are usually not unique as the size of a hash value is limited, whereas the length of a string is not. Think of it like this: Say the hash function is MD5 (it's not the default in .Net) and you have a string both consisting of hex-dec chars (0-9A-Z) and the string is 200 characters long: there are 200^16 possible values for the string, but only 32^16 possible values for its hash-value.

Mene
  • 3,739
  • 21
  • 40
  • 1
    Thank you, Mene. The question is not concrete, because I have not really faced with this problem, so it's just a theoretical problem that was interesting for me. Usually nobody cares about this issue. – Warlock Sep 18 '12 at 08:45
2

So, different strings can return the same hash code, hashcode is not unique for strings. Can it lead to bugs in the core of a program?

It should not lead to bugs, provided the hash values are used as intended. Hashes returned by GetHashCode() are not intended to provide unique hashes - this would be impossible, as there are only about 4 billion possible hash codes (as that method returns an Int32), but an infinite number of possible strings.

Hashes are meant to provide few collections, not no collisions. As such, you should never assume that a hash is a unique representation based on value. The only guarantee you get is that two different hash codes for two different strings mean that the strings are not equal, as two equal values should always produce the same hash. Two hash codes that are equal, however, do not necessarily mean the two string values are equal.

Reed Copsey
  • 554,122
  • 78
  • 1,158
  • 1,373
2

Hashcode is used to speed up searching of objects in Hash collections. Internally they store objects in many buckets. Objects being held are divided into buckets based on their hashcode. So when you call for example

var value = Dictionary["someKey"]

dictionary instead of searching in all internal buckets goes directly to the bucket that should contain value under that key. And dictionary searches only in that bucket.

Maybe this isn't exactly how it's implemented but it should be it more or less. So in this case it doesn't matter that different keys in dictionary have same hashcode. That only means that values under that keys will end up in same bucket.

Piotr Perak
  • 10,718
  • 9
  • 49
  • 86
  • 1
    Thank you for your answer, Peri. It is very helpful for understanding internal processes in .Net. I would like to know it deeper and I will be very excited if you provide references on articles or books :) – Warlock Sep 17 '12 at 18:47
2

The documentation is pretty accurate on the guarantees the method makes, actually. The hash code just follows the following two rules (a == b refers to a.Equals(b), #a refers to a.GetHashCode() for readability reasons):

  • If a == b Then #a == #b
  • If #a != #b Then a != b

Note that this is not an equivalence between Equals and a matching hash. If you rely on more than that, well, then your code has a bug, obviously. GetHashCode is intended for using objects as keys in dictionaries so that there is a quick mapping from objects to a number, but it doesn't need to be reversible. If you look at strings you can easily see that the number of possible strings quickly surpasses the number of possible hash codes, so you could have answered that question on your own. You're beyond 232 possible strings at a little more than two characters already.

Joey
  • 344,408
  • 85
  • 689
  • 683