C# - String.GetHashCode() -> don't use as unique identifier

Question

I am currently reading in Troelsen's book C# and the .NET 4.5 framework. There is a section in the book where he has an example of overriding

public virtual int GetHashCode(); // Defined in System.Object

He says (the following quotation is from Troelsen's book):

Given that the String class already has a solid hash code algorithm that is using the character data of the String to compute a hash value, if you can identify a piece of field data on your class that should be unique for all instances (such as a Social Security number), simply call GetHashCode() on that point of field data.

Basically what he says is that a certain class has a member (automatic read-only property)

public string SSN {get; }

and every instance of that class is going to have a unique string value. Now, under the assumption that

// s1 and s2 are strings
s1.GetHashCode() != s2.GetHashCode(); // Assumption: If this true then s1 == s2 is true

his reasoning would be valid. However, when I read on String.GetHashCode():

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

I think you see where I am going with this. I guess it's me who's missing something, if so, please point me in the right direction.

Thanks!

I think he is saying "if you want a hash code for an object you can use the system supplied hash code of a unique string field (aka key)" He is not claiming that hash codes are unique (they never are), he is saying that you dont need to hash over the whole object — pm100, Nov 16 '15 at 23:07
I see, so the purpose of the hash code is NOT to end up with an integer that is unique for given string, i.e. we are not looking for s1 != s2 => s1.GetHashCode() != s2.GetHashCode(). The overriding of GetHashCode is just to leverage the GetHaseCode function of the String class, so that objects of his example class could be used in collection classes using hash tables. Thanks! — jensa, Nov 16 '15 at 23:10
Bad advice, because objects having the same SSN but different values in other members will have the same hash code and cause high amounts of collisions. A hash code really should include all state that determines equality. — usr, Nov 16 '15 at 23:13
Whenever there is a discrepancy between some book and MSDN - MSDN wins. MSDN clearly states that you can get hash collisions and not to rely on a string's hash uniqueness. The book got it wrong. — Sam Axe, Nov 16 '15 at 23:16
@SamAxe I think it was me who interpreted the book a bit wrong. He is indeed NOTclaiming the following (s1, s2 are strings): s1 != s2 => s1.GetHashCode() != s2.GetHashCode() — jensa, Nov 16 '15 at 23:17
@usr The book is assuming that there won't be any different `Person` instances with the same `SSN` values: "if you can identify a piece of field data on your class that should be unique for all instances (such as a Social Security number)". But yeah, if you don't actually define equality in terms of the SSN, your point is a very good one to keep in mind. — , Nov 16 '15 at 23:20
This is off-topic but dangerous enough to point out: "should be unique for all instances (such as a Social Security number)"--Unless the class is about SSNs, that field is unlikely to be non-null and unique. — Tom Blodget, Nov 16 '15 at 23:24

score 7 · Answer 1 · answered Nov 16 '15 at 23:35

The purpose of GetHashCode is not to generate a unique identifier for an object, but to implement data structures that are based on hash tables, such as Dictionary<K, V> or HashSet<T>.

A hash function is required to ensure that if x == y, then x.GetHashCode() == y.GetHashCode(), but the converse is not true: Two distinct objects can have the same hash code. This situation is called a hash collision.

Hash table structures will still work if there are collisions, but they run slower because your program has to spend time disambiguiting which of the collided object you're searching for. So, a good hash function will strive to minimize collisions. (Note that it's a mathematical impossibility to completely avoid collisions if there are more than 2³² possible values for a class, because of the pigeonhole principle.)

So how, then, do you write a good GetHashCode implementation for your class? Do some complicated math to convert each of your class's fields to int, and then profile it to determine optimal values for the coefficients therein?

According to Troelsen, no. Just take your “most unique” string field and call GetHashCode() on that. The developers who wrote System.String.GetHashCode knew what they were doing, so just use it, and you'll automatically take advantage of their “solid hash code algorithm”.

So using a `HashSet` reduces memory for large collection of strings more than a `string[]` for example, right? — FindOutIslamNow, Sep 05 '18 at 14:37
@FindOutIslamNow: No, it does not reduce memory usage. The advantage of a hash table over a simple array is speed: The average case for searching for an item in a properly-implemented hash table is O(1), compared to O(log n) for a sorted array or binary tree, or O(n) for an unsorted array. — dan04, Sep 05 '18 at 15:10

score 6 · Accepted Answer · answered Nov 16 '15 at 23:10

You're right that equal hash codes don't guarantee equal values.

You're wrong in thinking that that quote means otherwise.

This quote is specifically in the context of implementing a hash code calculation for a Person class containing an SSN property. Equal SSN values mean equal Person values. Different SSN values mean different Person values. (Note: this is not necessarily true in reality.)

Now, you need a hash code calculation for Person which guarantees that two equal Person instances have the same hash code, and ideally which makes it likely that two unequal Person instances have a different hash code, although the latter can never be guaranteed. Since equality is defined in terms of the SSN, that means re-using the SSN's hash code achieves that already.

score 6 · Answer 3 · edited May 23 '17 at 11:51

6

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

There are an infinite number of strings, but only 2^32 possible hash code values. It is inevitable that two different strings will end up having the same hash value. This happens more often than one might think, due to the birthday problem.

don't use as unique identifier

That is good advice. Debug builds of .NET actually vary the hash code periodically to help catch this type of problem, and different framework versions provide no guarantee of generating the same hash. For more see this.

Have a look at Eric Lippert's blog on the topic.

edited May 23 '17 at 11:51

Community

1
1

answered Nov 16 '15 at 23:11

Eric J.

147,927
63
340
553

1

Thanks for the link; this one is also relevant: http://ericlippert.com/2010/03/22/socks-birthdays-and-hash-collisions/ – Eric Lippert Nov 19 '15 at 16:29
Interesting way to explain it... matching socks as a 2-bit hash algorithm. – Eric J. Nov 19 '15 at 18:01
1

Side note: When I worked at a company that created device fingerprints, we used a 64-bit hash (City Hash). With tens of millions of hashes, we had a few hash collisions in that 64-bit hash space. – Eric J. Nov 19 '15 at 18:05

C# - String.GetHashCode() -> don't use as unique identifier

3 Answers3

Linked