5

Before I begin, all code samples here I tested on Mono environment and there is one noticeable difference in the GetHashCode implementations:

string.Empty.GetHashCode(); // returns 0 in Mono 3.10
string.Empty.GetHashCode(); // returns 757602046 in .NET 4.5.1

I made my implementation based on this SO Answer by @JonSkeet and in the comments he also suggests to use 0 hash code for NULL values (wasn't sure how should I hash them).

I usually use 0 as the effective hash code for null - which isn't the same as ignoring the field.

So having following implementation (Mono 3.10):

public class Entity {
    public int EntityID { get; set; }
    public string EntityName { get; set; }

    public override int GetHashCode() {
        unchecked {
            int hash = 15485863;       // prime number
            int multiplier = 1299709;  // another prime number

            hash = hash * multiplier + EntityID.GetHashCode();
            hash = hash * multiplier + (EntityName != null ? EntityName.GetHashCode() : 0);

            return hash;
        }
    }
}

It is quite easy to find collision e.g.

var hash1 = new Entity { EntityID = 1337, EntityName = "" }.GetHashCode();
var hash2 = new Entity { EntityID = 1337, EntityName = null }.GetHashCode();

bool equals = hash1 == hash2; // true

I could replace null-value 0 with some other number, however it won't fix the problem as there still is a chance that some hash(string) output will generate such number and I'll get another collision.

My question: How should I handle null values while using algorithm from example above?

Community
  • 1
  • 1
Andriy Horen
  • 2,861
  • 4
  • 18
  • 38
  • 4
    You will always have collisions in hash codes. That just is the nature of hash codes. for that reason you **always** implement a equals function for a class that generates a hash code to verify that two objects returning the same hash code are really equal. – Nitram Jul 03 '15 at 12:35
  • Objects from my example have different state with same hash codes. Am i missing something? – Andriy Horen Jul 03 '15 at 12:37
  • Ignore the suggestion and do not use `0` for `null` instead use something else as `String.Empty` is also giving you `0` on mono, Also as @Nitram said you can't always avoid collision. – Habib Jul 03 '15 at 12:37
  • @Nitram Isn't it a main purpose of hashcodes to represent each unique state with unique value? – Andriy Horen Jul 03 '15 at 12:38
  • 1
    You should be implementing `IEquatable`. A hashcode should not be used to define an objects uniqueness. – Yuval Itzchakov Jul 03 '15 at 12:38
  • No. The purpose of `GetHashCode` is to provide a good distribution of your data when using structures that divide entities into buckets, such as a `Dictionary`. – Yuval Itzchakov Jul 03 '15 at 12:39
  • 1
    @AndriyHoren Short answer : no. How would you generate a unique int Hashcode from strings containing potentially thousands of characters, not to mention more complex objects ? – xlecoustillier Jul 03 '15 at 12:39
  • @AndriyHoren No it is not. The hash code is basically there to tell you that two objects could be equal. Those are used for the storage system in `HashSet`s and `Dictionary`s for example to speed up the lookup performance. – Nitram Jul 03 '15 at 12:40
  • `GetHashCode` **needs** to return the same value for the lifetime of the object. Dictionaries, for example, rely on the value not changing. So you **can't** compute hash code on properties that change value. You must only ever use read-only properties or fields. – Enigmativity Jul 03 '15 at 12:55

2 Answers2

4

My question: How should I handle null values while using algorithm from example above?

I don't think the problem is with null per-se. The problem lays in the fact you're using GetHashCode for equality, which it isn't meant for. GetHashCode should provide such hashes that aspire to normal distribution.

The docs say:

Two objects that are equal return hash codes that are equal. However, the reverse is not true: equal hash codes do not imply object equality, because different (unequal) objects can have identical hash codes.

And then goes on to specify the purpose of GetHashCode:

A hash code is intended for efficient insertion and lookup in collections that are based on a hash table.

You should be implementing IEquatable<Entity>, where you actually define the equivalence relation of two entities. And override != and == while you're at it.

An approximation:

public class Entity : IEquatable<Entity>
{
    public int EntityId { get; set; }
    public string EntityName { get; set; }

    public bool Equals(Entity other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return EntityId == other.EntityId && 
               string.Equals(EntityName, other.EntityName);
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != this.GetType()) return false;
        return Equals((Entity) obj);
    }

    public static bool operator ==(Entity left, Entity right)
    {
        return Equals(left, right);
    }

    public static bool operator !=(Entity left, Entity right)
    {
        return !Equals(left, right);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            return (EntityId*397) ^ (EntityName != null ? EntityName.GetHashCode() : 0);
        }
    }
}
Yuval Itzchakov
  • 146,575
  • 32
  • 257
  • 321
4

Your "problem" here is that you are trying the get collision free hash codes. While this is perfect for the lookup performance of collection implementations that use the hash code for lookup (e.g. HashSet and Dictionary) in the most cases this will not work.

The reason for that is that the hash code is just a 32-bit integer value and it represents data that is usually a lot bigger (multiple integer values, strings, etc.).

So the hash code is only there to define that two objects could be equal. The collection classes use the hash code to refine the area where the object is stored and use the equals function to find if two objects are really the same. For that reason you should always implement the Equals function for classes you implemented the hash code for. While those classes will fall back to the equals function of object, is it also a good idea to implement the IEquatable<T> interface to avoid typing problems of any kind (still overwrite the default equals method of Object!)

Mike Chamberlain
  • 39,692
  • 27
  • 110
  • 158
Nitram
  • 6,486
  • 2
  • 21
  • 32
  • 1
    Now I see, for some reason I used to think that it is good idea to implement `Equals` in the way `obj.GetHashCode() == this.GetHashCode()` (+ type/null checks) and that is why I was looking for collision free algorithm, but actually, as you said 32-bit integer hash will not guarantee uniquness as it too small for most cases. Thank you – Andriy Horen Jul 03 '15 at 13:07