I am having a problem with hash collisions using short strings in .NET4.
EDIT: I am using the built-in string hashing function in .NET.
I'm implementing a cache using objects that store the direction of a conversion like this
public class MyClass
{
private string _from;
private string _to;
// More code here....
public MyClass(string from, string to)
{
this._from = from;
this._to = to;
}
public override int GetHashCode()
{
return string.Concat(this._from, this._to).GetHashCode();
}
public bool Equals(MyClass other)
{
return this.To == other.To && this.From == other.From;
}
public override bool Equals(object obj)
{
if (obj == null) return false;
if (this.GetType() != obj.GetType()) return false;
return Equals(obj as MyClass);
}
}
This is direction dependent and the from
and to
are represented by short strings like "AAB" and "ABA".
I am getting sparse hash collisions with these small strings, I have tried something simple like adding a salt (did not work).
The problem is that too many of my small strings like "AABABA" collides its hash with the reverse of "ABAAAB" (Note that these are not real examples, I have no idea if AAB and ABA actually cause collisions!)
and I have gone heavy duty like implementing MD5 (which works, but is MUCH slower)
I have also implemented the suggestion from Jon Skeet here:
Should I use a concatenation of my string fields as a hash code?
This works but I don't know how dependable it is with my various 3-character strings.
How can I improve and stabilize the hashing of small strings without adding too much overhead like MD5?
EDIT: In response to a few of the answers posted... the cache is implemented using concurrent dictionaries keyed from MyClass
as stubbed out above. If I replace the GetHashCode
in the code above with something simple like @JonSkeet 's code from the link I posted:
int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();
return hash;
Everything functions as expected. It's also worth noting that in this particular use-case the cache is not used in a multi-threaded environment so there is no race condition.
EDIT: I should also note that this misbehavior is platform dependant. It works as intended on my fully updated Win7x64 machine but does not behave properly on a non-updated Win7x64 machine. I don't know the extend of what updates are missing but I know it doesn't have Win7 SP1... so I would assume there may also be a framework SP or update it's missing as well.
EDIT: As susggested, my issue was not caused by a problem with the hashing function. I had an elusive race condition, which is why it worked on some computers but not others and also why a "slower" hashing method made things work properly. The answer I selected was the most useful in understanding why my problem was not hash collisions in the dictionary.