GetHashCode doesn't guarantee uniqueness of value/number it returns

Question

I'm developing a simple 2D environment and each object drawn e.g. line, rectangle and ... gets a unique id by calling GetHashCode()

Now, I noticed on MSDN page it doesn't guarantee its result would be unique:

The default implementation of the GetHashCode method does not guarantee unique return values for different objects. Furthermore, the .NET Framework does not guarantee the default implementation of the GetHashCode method, and the value it returns will be the same between different versions of the .NET Framework. Consequently, the default implementation of this method must not be used as a unique object identifier for hashing purposes.

Now, question is what other options do exist beside GetHashCode() method?

Thanks, Amit

Related thread : http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode — KV Prajapati, Sep 29 '12 at 01:25

score 3 · Answer 1 · answered Sep 29 '12 at 01:36

Perhaps it would be best to move away from hash codes altogether? GetHashCode is nice for a quick and easy fix, but if you need real IDs for objects, then you should create real IDs. Something like a 32/64 bit auto-incrementing integer would likely be plenty.

While the collision rate of a Hash code is tied to the length of the hash, you are still not guaranteed to reach the maximum number of unique hashes possible before getting a collision. If you manage the IDs yourself you can plan ahead to have enough IDs available.

Also - your comment about the GetHashCode() different between versions of the framework. I can only imagine this would matter in your situation if you were persisting the hashes to some sort of save file, and then trying to re-load them only to find out they don't match the hash of the running program because they were saved by a different version of the framework. If this is the case, I would suggest even more that you create and manage IDs on the objects yourself.

paparazzo · Answer 2 · 2012-09-29T21:28:37.687

You will need to generate you own unique ID

Some times can derive a unique ID from object properties if your object has a natural key.
If the object does not have a natural key then must generate a unique ID and you would typically pass the unique ID to the object in the constructor.

GetHashCode is poor unique ID as it is not guaranteed to be unique.
Internally .NET does not use GetHashCode for uniqueness.
Internally .NET uses GetHashCode to speed equality comparison and for HashBuckets.

If you are going to generate your own unique ID then you should override GetHashCode and Equals.
That way .NET can use your unique identifier for equality comparison.

.NET GetHashCode() is not required nor guaranteed to be unique.
.NET GetHashCode() is not just limited to Int32.
.NET GetHashCode() is Int32.

If the GetHashCode are not equal then two objects are not equal.
If GetHashCode is equal then two objects may or may not be equal. Equals is the tie breaker.
For speed first GetHashCode is compared. GetHashCode is also use for hashbuckets for speed of collections like HashSet and Dictionary.

If a hash is unique then it is considered a perfect hash.

Classic example

class Point: object 
{
   protected int x, y;

   public Point(int xValue, int yValue)
   {
        x = xValue;
        y = yValue;
   }
   public override bool Equals(Object obj) 
   {
      // Check for null values and compare run-time types.
      if (obj == null || GetType() != obj.GetType()) 
         return false;

      Point p = (Point)obj;
      return (x == p.x) && (y == p.y);
   }
   public override int GetHashCode() 
   {
      return x ^ y;
   }
}

Since Point has Int32 X Int32 possible values then obviously it cannot be uniquely identified with a single Int32. Still GetHashCode is of value and required. There is only 1/Int32 chance the more expensive Equals will be required and the GetHashCode is used for hash buckets.

Consider simple point

class Point: object 
{
   protected byte x, y;

   public Point(byte xValue, byte yValue)
   {
        x = xValue;
        y = yValue;
   }
   public override bool Equals(Object obj) 
   {
      // Check for null values and compare run-time types.
      if (obj == null || GetType() != obj.GetType()) 
         return false;

      Point p = (Point)obj;
      return (x == p.x) && (y == p.y);
   }
   public override int GetHashCode() 
   {
      return (x * 256) + y;
   }
}

In this simple point GetHashCode will uniquely identify the object. You cannot override one of the other. Must override neither or both.

Do you think this would be better than a simple factory design pattern that gives out a simple ID to each object as it is created? — Origin, Sep 30 '12 at 06:23
@Origin Second paragraph "If the object does not have a natural key then must generate a unique ID". This was just an example of using a natural key. If the object has a natural key consider P1 = new Point(1,2) P2 = new Point(1,2). Do you want P1.Equals(P2) to return true or false? If the natural key is already there then deriving the ID from the natural key doe not use up a variable. If the natural key is used for GetHashCode then then those values should not be changed. Agree a factory is the most common pattern. And most appropriate for most situations. — paparazzo, Sep 30 '12 at 18:33

score 2 · Answer 3 · answered Sep 29 '12 at 01:39

2

It depends on what you are using the unique Id for. It sounds like you are using to identify object instances, which might mean that Hash Codes are not what you want.

If two objects are .Equals() of each other, they are supposed to have the same hash code, but as you discovered, the reverse is not true (having the same hash code doesn't mean they are .Equals()).

What do you need the unique Id's for? If you aren't using hash codes to put the objects in a lookup you might be better off assigning them a unique id like a Guid (var uniqueId = Guid.NewGuid()).

answered Sep 29 '12 at 01:39

Giscard Biamby

4,569
1
22
24

1

GUID is still limited range, and shorter than SHA-1. – linquize Sep 29 '12 at 02:15
You're right. I'm always fooled by the name. An auto incrementing int might be better. – Giscard Biamby Sep 29 '12 at 02:32

score 1 · Answer 4 · answered Sep 29 '12 at 01:26

1

No hash function guarantees the uniqueness of value returned.

It depends how small the probability of collision it can be.

GetHashCode() returns a 32-bit integer, which may not be enough to assume uniqueness. Consider other algorithm such as SHA-1, SHA-2, which the length of hash is longer, probability of collision is much lower than a 32-bit integer.

answered Sep 29 '12 at 01:26

linquize

19,828
10
59
83

1

+1 for pointing out that hashing is not guaranteed to return unique values, but -1 for suggesting other hash routines which have the same issue. Hashing is the wrong answer for unique ids here. – Will Sep 29 '12 at 01:42
@Will: I disagree. Given a long enough hash you can pretty much treat the hash as a unique identifier for the input. Take a 256bit SHA-2 Hash (still pretty modest in length): If you generate 10^9 hashes per second for the next 14 billion years the probability of getting **one** collision in there is about 10^-20. Meaning the chance of that occuring is zero for all practical applications. Afterall one has consider that there is always a possibility of errors (e.g. due to radiation,...), which is much more likely then a hash collision. – Grizzly Sep 29 '12 at 02:28
@Grizzly Pretty much? Overrite Equals an take you own changes. – paparazzo Sep 29 '12 at 02:32
@Blam: What are you taking about? – Grizzly Sep 29 '12 at 02:33
@Grizzly why would you want to use a *32-byte* identifier that has an infinitesimal probability of collision, when you could use a 4-byte identifier (or maybe even 1 or 2 bytes) that has *zero* probability of collision? Zero is better than infinitesimal. – phoog Sep 29 '12 at 06:10
@phoog: Looking at my comment I should have made myself clearer. It isn't that I think that hashing is a good solution in this specific case. What I disagreed on was the part about other hashes still having collision problems, since that is obviously a non issue for practical applications. Though the reasons for prefering other methods are more of a performance matter then one of probability (there is always an infinitesimal risk, since computational errors aren't impossible and probably more likely then a collision). – Grizzly Sep 29 '12 at 10:53
@Blam: Your point being? Even a significant chance of collision would be perfectly fine for a conforming `Equals` method (if it's fine for the use case depends on the scenario obviously). – Grizzly Sep 29 '12 at 10:55
@Grizzly But use case is .NET. That is how .NET uses GetHashCode and this is tagged .NET. – paparazzo Sep 29 '12 at 13:15
@Blam: .Net is not the use case, its the plattform the software is developed on. The use case for the `Equals` method is the behaviour one needs it to exhibit, which obviously depends on the class and the usage of the class one is working on. The only requirements .Net puts on the `Equals` method is that it is consistant. A `Equals` method, which always returns true would be perfectly valid for example, as is one which has the theoretical possibility of false positives. The only connection to `GetHashcode` is that the Hash must be identical, if the objects compare identical. – Grizzly Sep 29 '12 at 14:27
@Grizzly What you call use case I call implementation. Equals should return true if equals and false if not. If equals = true then GetHashCode must be the same. GetHashCode is used for HashBuckets. That is what I call a use case. I do not get to define how .NET uses those values. That is a use case defined by .NET. How the object generates those values I call an implementation. Yes .NET is a platform. – paparazzo Sep 29 '12 at 17:58

GetHashCode doesn't guarantee uniqueness of value/number it returns

4 Answers4