2

I am having a problem with hash collisions using short strings in .NET4.
EDIT: I am using the built-in string hashing function in .NET.

I'm implementing a cache using objects that store the direction of a conversion like this

public class MyClass
{
    private string _from;
    private string _to;

   // More code here....

    public MyClass(string from, string to)
    {
        this._from = from;
        this._to = to;
    }

    public override int GetHashCode()
    {
        return string.Concat(this._from, this._to).GetHashCode();
    }

    public bool Equals(MyClass other)
    {
        return this.To == other.To && this.From == other.From;
    }

    public override bool Equals(object obj)
    {
        if (obj == null) return false;
        if (this.GetType() != obj.GetType()) return false;
        return Equals(obj as MyClass);
    }
}

This is direction dependent and the from and to are represented by short strings like "AAB" and "ABA".

I am getting sparse hash collisions with these small strings, I have tried something simple like adding a salt (did not work).

The problem is that too many of my small strings like "AABABA" collides its hash with the reverse of "ABAAAB" (Note that these are not real examples, I have no idea if AAB and ABA actually cause collisions!)

and I have gone heavy duty like implementing MD5 (which works, but is MUCH slower)

I have also implemented the suggestion from Jon Skeet here:
Should I use a concatenation of my string fields as a hash code? This works but I don't know how dependable it is with my various 3-character strings.

How can I improve and stabilize the hashing of small strings without adding too much overhead like MD5?

EDIT: In response to a few of the answers posted... the cache is implemented using concurrent dictionaries keyed from MyClass as stubbed out above. If I replace the GetHashCode in the code above with something simple like @JonSkeet 's code from the link I posted:

int hash = 17;
hash = hash * 23 + this._from.GetHashCode();
hash = hash * 23 + this._to.GetHashCode();        
return hash;

Everything functions as expected. It's also worth noting that in this particular use-case the cache is not used in a multi-threaded environment so there is no race condition.

EDIT: I should also note that this misbehavior is platform dependant. It works as intended on my fully updated Win7x64 machine but does not behave properly on a non-updated Win7x64 machine. I don't know the extend of what updates are missing but I know it doesn't have Win7 SP1... so I would assume there may also be a framework SP or update it's missing as well.

EDIT: As susggested, my issue was not caused by a problem with the hashing function. I had an elusive race condition, which is why it worked on some computers but not others and also why a "slower" hashing method made things work properly. The answer I selected was the most useful in understanding why my problem was not hash collisions in the dictionary.

Community
  • 1
  • 1
Matthew
  • 10,244
  • 5
  • 49
  • 104
  • 1
    You're getting collisions with 3-character strings? Care to post some of them? I suspect something other than the hash function. – Dour High Arch Dec 22 '11 at 00:57
  • Can you show how the cache is implemented? – saus Dec 22 '11 at 04:42
  • In the code you've showed Equals is not an override of "public virtual bool Equals(object obj)"... And what kind of updates you've referenced below? – ttil Dec 22 '11 at 19:10
  • I put a few simple tests on generated 3 + 3 pairs. I've gotten collision rate of 1.7 for the first case (with Concat) and rate of 4 for the second. As expected the second one gives more collisions (but yet is better in this case, because it doesn't create new objects). And I don't see how collisions could cause your issues. Collisions are expected and handled correctly in Dictionary. It's more about Equals, but it looks fine. And MyClass looks good. I would say that problem is somewhere outside of this class - how dictionary is used, how MyClass instances are created and so on... – ttil Dec 22 '11 at 19:56
  • @Downvoter, care to explain? I may be on the wrong track for a solution here but I am certainly putting effort into the question.... – Matthew Dec 22 '11 at 22:01

2 Answers2

7

Are you sure that collisions are who causes problems? When you say

I finally discovered what was causing this bug

You mean some slowness of your code or something else? If not I'm curious what kind of problem is that? Because any hash function (except "perfect" hash functions on limited domains) would cause collisions.

I put a quick piece of code to check for collisions for 3-letter words. And this code doesn't report a single collision for them. You see what I mean? Looks like buid-in hash algorithm is not so bad.

Dictionary<int, bool> set = new Dictionary<int, bool>();
char[] buffer = new char[3];
int count = 0;
for (int c1 = (int)'A'; c1 <= (int)'z'; c1++)
{
    buffer[0] = (char)c1;
    for (int c2 = (int)'A'; c2 <= (int)'z'; c2++)
    {
        buffer[1] = (char)c2;
        for (int c3 = (int)'A'; c3 <= (int)'z'; c3++)
        {
            buffer[2] = (char)c3;
            string str = new string(buffer);
            count++;
            int hash = str.GetHashCode();
            if (set.ContainsKey(hash))
            {
                Console.WriteLine("Collision for {0}", str);
            }
            set[hash] = false;
        }
    }
}

Console.WriteLine("Generated {0} of {1} hashes", set.Count, count);

While you could pick almost any of well-known hash functions (as David mentioned) or even choose a "perfect" hash since it looks like your domain is limited (like minimum perfect hash)... It would be great to understand if the source of problems are really collisions.

Update

What I want to say is that .NET build-in hash function for string is not so bad. It doesn't give so many collisions that you would need to write your own algorithm in regular scenarios. And this doesn't depend on the lenght of strings. If you have a lot of 6-symbol strings that doesn't imply that your chances to see a collision are highier than with 1000-symbol strings. This is one of the basic properties of hash functions.

And again, another question is what kind of problems do you experience because of collisions? All build-in hashtables and dictionaries support collision resolution. So I would say all you can see is just... probably some slowness. Is this your problem?

As for your code

return string.Concat(this._from, this._to).GetHashCode(); 

This can cause problems. Because on every hash code calculation you create a new string. Maybe this is what causes your issues?

int hash = 17; 
hash = hash * 23 + this._from.GetHashCode(); 
hash = hash * 23 + this._to.GetHashCode();         
return hash; 

This would be much better approach - just because you don't create new objects on the heap. Actually it's one of the main points of this approach - get a good hash code of an object with a complex "key" without creating new objects. So if you don't have a single value key then this should work for you. BTW, this is not a new hash function, this is just a way to combine existing hash values without compromising main properties of hash functions.

ttil
  • 176
  • 3
  • Yes, I too am curious why a colliding value might cause a bug. At the very least, it seems that hashing has been misunderstood because collisions are expected and should be dealt with regardless (other than in the case of a perfect hash such as you describe here) – spender Dec 22 '11 at 01:09
  • 2
    +1 I find it hard to believe Microsoft's hashing implementation is that bad. I wonder if he's perhaps truncating the hash or otherwise mishandling and that's causing the collisions. And, of course, if your code has bugs on a hash collision, you're doing something very wrong, since collisions should be expected. – David Schwartz Dec 22 '11 at 02:52
  • My description was misleading. They are actually six character strings, made from concatenation of two three-character strings. – Matthew Dec 22 '11 at 15:17
  • @Matthew PK I've updated my post a bit. But it's still unclear what kind of problems collisions cause for you... – ttil Dec 22 '11 at 18:25
  • @ttil I am using `MyClass` as posted as the key in a `ConcurrentDictionary`. The defect I'm seeing is that when I create a new `MyClass` with the `to` and `from` reversed, the `ConcurrentDictionary` does not recognize it as a new key. – Matthew Dec 22 '11 at 18:37
  • 1
    @Matthew PK Here is a possible problem. Dictionary doesn't rely on hash code only. If hash codes are defferent - all right, object are different. But if hash codes are equal than dictionary uses Equals method to actually compare objects (see http://msdn.microsoft.com/en-us/library/system.object.gethashcode.aspx). So first of all, you need to override Equal method as well as GetHashCode. And the second make sure that all that implemented correctly, because what you just described just doesn't sound right. Actual code will help. – ttil Dec 22 '11 at 18:46
  • @MatthewPK Just small clarification. By "doesn't sound right" I mean that if you do have two different object why would dictionary indicate that they are the same? Either they are the same, or MyClass tells it that they are the same. And collision can't cause this. I would say this is either an Equal function implementaion, or some other mistake in MyClass (like fields were assigned in a wrong way...). Those are my speculations :) – ttil Dec 22 '11 at 19:01
  • @ttil I have correctly overridden the equality as well. I edited my question to demonstrate this. The code I posted is a pretty good representation of my actual code. – Matthew Dec 22 '11 at 19:03
  • @ttil I might also add this this behavior is platform dependent. As it stands now it behaves properly on a fully updated machine but not on an un-updated machine. – Matthew Dec 22 '11 at 19:04
  • I marked this the answer because it was most instrumental in locating my defect. – Matthew Dec 27 '11 at 15:44
2

Any common hash function should be suitable for this purpose. If you're getting collisions on short strings like that, I'd say you're using an unusually bad hash function. You can use Jenkins or Knuth's with no issues.

Here's a very simple hash function that should be adequate. (Implemented in C, but should easily port to any similar language.)

unsigned int hash(const char *it)
{
 unsigned hval=0;
 while(*it!=0)
 {
  hval+=*it++;
  hval+=(hval<<10);
  hval^=(hval>>6);
  hval+=(hval<<3);
  hval^=(hval>>11);
  hval+=(hval<<15);
 }
 return hval;
}

Note that if you want to trim the bits of the output of this function, you must use the least significant bits. You can also use mod to reduce the output range. The last character of the string tends to only affect the low-order bits. If you need a more even distribution, change return hval; to return hval * 2654435761U;.

Update:

public override int GetHashCode()
{
    return string.Concat(this._from, this._to).GetHashCode();
}

This is broken. It treats from="foot",to="ar" as the same as from="foo",to="tar". Since your Equals function doesn't consider those equal, your hash function should not. Possible fixes include:

1) Form the string from,"XXX",to and hash that. (This assumes the string "XXX" almost never appears in your input strings.

2) Combine the hash of 'from' with the hash of 'to'. You'll have to use a clever combining function. For example, XOR or sum will cause from="foo",to="bar" to hash the same as from="bar",to="foo". Unfortunately, choosing the right combining function is not easy without knowing the internals of the hashing function. You can try:

int hc1=from.GetHashCode();
int hc2=to.GetHashCode();
return (hc1<<7)^(hc2>>25)^(hc1>>21)^(hc2<<11);
David Schwartz
  • 179,497
  • 17
  • 214
  • 278