Short answer: Yes.
But short answers are no fun, are they?
When you are implementing GetHashCode()
you have to make the following guarantee:
When GetHashCode()
is called on another object that should be considered equal to this, in this App Domain, the same value will be returned.
That's it. There's some things you really need to try to do (spread the bits around with non-equal objects as much as possible, but don't take so long about it that it outweighs all the benefits of hashing in the first place) and your code will suck if you don't do so, but it won't actually break. It will break if you don't go that far, because then e.g.:
dict[myObj] = 3;
int x = dict[myObj];//KeyNotFoundException
Okay. If I'm implementing GetHashCode()
, why might I go further than that, and why might I not?
First, why might I not?
Maybe it's a slightly different version of the assembly and I improved (or at least attempted to) in between builds.
Maybe one is 32-bit and one is 64-bit and I was going nuts for efficiency and chose a different algorithm for each to make use of the different word sizes (this is not unheard of, especially when hashing objects like collections or strings).
Maybe some element I'm deciding to consider in deciding on what constitutes "equal" objects is itself varying from system to system in this sort of way.
Maybe I actually deliberately introduce a different seed with different builds to catch any case where a colleague is mistakenly depending upon my hash code! (I've heard MS do this with their implementation for string.GetHashCode()
, but can't remember whether I heard that from a credible or credulous source).
Mainly though, it'll be one of the first two reasons.
Now, why might I give such a guarantee?
Most likely if I do, it'll be by chance. If an element can be compared for equality on the basis of a single integer id alone, then that's what I'm going to use as my hash-code. Anything else will be more work for a less good hash. I'm not likely to change this, so I might.
The other reason why I might, is that I want that guarantee myself. There's nothing to say I can't provide it, just that I don't have to.
Okay, let's get to something practical. There are cases where you may want a machine-independent guarantee. There are cases where you may want the opposite, which I'll come to in a bit.
First, check your logic. Can you handle collisions? Good, then we'll begin.
If it's your own class, then implement so as to provide such a guarantee, document it, and you're done.
If it's not your class, then implement IEqualityComparer<T>
in such a way as to provide it. For example:
public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
if(obj == null)
return 0;
int hash = obj.Length;
for(int i = 0; i != obj.Length; ++i)
hash = (hash << 5) - hash + obj[i];
return hash;
}
}
Then use this instead of the built-in hash-code.
There's an interesting case where we may want the opposite. If I can control the set of strings you are hashing, then I can pick a bunch of strings with the same hash-code. Your hash-based collection's performance will hit the worse-case and be pretty atrocious. Chances are I can keep doing this faster than you can deal with it, so it can be a denial of service attack. There's not many cases where this happens, but an important one is if you're handling XML documents I send and you can't just rule out some elements (a lot of formats allow for freedom of elements within them). Then the NameTable
inside your parser will be hurt. In this case we create a new hash mechanism each time:
public class RandomComparer : IEqualityComparer<string>
{
private int hashSeed = Environment.TickCount;
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
if(obj == null)
return 0;
int hash = hashSeed + obj.Length;
for(int i = 0; i != obj.Length; ++i)
hash = hash << 5 - hash + obj[i];
hash += (hash << 15) ^ 0xffffcd7d;
hash ^= (hash >>> 10);
hash += (hash << 3);
hash ^= (hash >>> 6);
hash += (hash << 2) + (hash << 14);
return hash ^ (hash >>> 16)
}
}
This will be consistent within a given use, but not consistent from use to use, so an attacker can't construct input to force it to be DoSsed. Incidentally, NameTable
doesn't use an IEqualityComparer<T>
because it wants to deal with char-arrays with indices and lengths without constructing a string unless necessary, but it does do something similar.
Incidentally, in Java the hash-code for string
is specified and won't change, but this may not be the case for other classes.
Edit: Having done some research into the overall quality of the approach taken in ConsistentGuaranteedComparer
above, I'm no longer happy with having such algorithms in my answers; while it serves to describe the concept, it doesn't have as good a distribution as one might like. Of course, if one has already implemented such a thing, then one can't change it without breaking the guarantee, but if I'd now recommend using this library of mine, written after said research as follows:
public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
return obj.SpookyHash32();
}
}
That for RandomComparer
above isn't as bad, but can also be improved:
public class RandomComparer : IEqualityComparer<string>
{
private int hashSeed = Environment.TickCount;
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
return obj.SpookyHash32(hashSeed);
}
}
Or for even harder predictability:
public class RandomComparer : IEqualityComparer<string>
{
private long seed0 = Environment.TickCount;
private long seed1 = DateTime.Now.Ticks;
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
return obj.SpookyHash128(seed0, seed1).GetHashCode();
}
}