Simple general-purpose hash function for a collection

Question

Please mark as duplicate, but most questions I've found so far are too specific or more complex than I'm looking for. E.g. in "What is a good hash function", the accepted answer seems to be oriented toward hashing strings.

I've recently started programming in .NET, and I find it unfortunate that built-in classes lack the ability to do some basic things like check equality and find their hash. I'm sure they have their design reasons for that; no need to defend .NET. I just want to know how to avoid a significant sidetrack when I need to use a collection as a key to a dictionary. I want, for example, two different List objects containing all equal values to map to the same entry in the dictionary. Out of the box, they don't: the default behavior for List is that a List is not equal to anything but itself, so another instance of a list with the same values is a different key.

Implementing Equals is straightforward. It's the hash function that I am unsure of.

Is there just something provided that I can call in my implementation of GetHashCode?

If I have to write it from scratch, what's a really simple but good enough hash algorithm? I could use SHA1 but I think it would be overkill. I could just xor all the hashes of the items, but I think that would have some nasty collision properties. I don't care if computing hashes is blazingly fast, but I don't want my hash table to slow to linear on data sets with some particular distribution. What I'd like is something so simple that I can memorize it. Bonus if you can explain (or link to) why it works.

Jim Mischel · Accepted Answer · 2013-08-14T03:14:19.517

Be very careful here. If you create a GetHashCode method for a List<T> (or similar collection), then presumably it'll do something like this:

public override int GetHashCode()
{
    int hash = 13;
    foreach (var t in this)
    {
        // X is an operation (undefined here) that somehow combines
        // the previous hash value and the item's hash value
        hash = hash X t.GetHashCode();
    }
    return hash;
}

(I would suggest something like the Jenkins hash for computing the hash code. Also look into the Wang hash (or bit mixer).)

Unless you compute that value the first time and the cache it, you will end up iterating over all of the items every time GetHashCode is called.

So you've created a GetHashCode and Equals for your collection and you put an instance into a Dictionary. Now you have to be very careful not to change the collection (i.e. don't add or remove any items) or any of the items inside the collection. Otherwise the value of GetHashCode will change, and the dictionary won't work anymore.

I strongly suggest that if you want to use a collection as the key to a dictionary, you make very sure that the collection is immutable.

One other thing to consider. The concept of list equality isn't as simple as you indicate. For example, are the lists [1, 2, 3, 4, 5] and [5, 1, 3, 4, 2] equal? It rather depends on your definition of equality. Certainly A.Union(B) == A.Intersect(B), which means that they're equal if your definition of equality is "contain the same items." But if order matters, then the lists aren't equal.

If your definition is "contain the same items," then the hash code calculation I showed above isn't going to work because hash code computations are order dependent. So if you wanted to compute the hash code of those lists, you'd have to sort them first.

If the lists cannot contain duplicates, then computing equality is a matter of creating a hash set of one list and looking up each item from the other list in that hash set. If the lists can contain duplicates, then you either have to sort them to determine equality, or use some kind of dictionary with a count. And both of those imply that the objects contained in the list will implement some form of equality comparer, etc.

And some definitions of equality don't take duplicates into account at all. That is, [1, 2, 3] would be equal to [3, 3, 3, 2, 1, 1].

Considering the varying differences of equality and the effort it would have taken to allow for those and more in defining the behavior of List<T>, I can understand why whoever designed the collection classes didn't implement value equality. Especially considering that it's pretty uncommon to use a List<T> or similar collection as the key in a dictionary or hash table.

I'm not sure why someone downvoted this answer but my upvote was for a) the excellent point that if a list is to be used as a key in a collection, it would have to be immutable, and b) the point that the hash code for the collection should be cached to avoid recomputation cost (which is, again, easier if the collection is immutable). I'd upvote again, if I could, for the recently-added discussion of what it might mean in different cases for two lists to be equal, which is another important point to bear in mind. — Simon, Aug 14 '13 at 04:39
A lot of unsolicited information I already knew, not much on the actual hash algorithm. — morningstar, Aug 14 '13 at 15:59
@morningstar: Jenkins is simple and much better than "good enough." That you apparently already knew the unsolicited information was not evident from your question, and I tend to err on the side of too much information. — Jim Mischel, Aug 14 '13 at 18:20
Jenkins is still oriented to 8-bit data. At least from the Wikipedia link I deduce that it treats all 8 bits equally. Could treat each item's GetHashCode result as a sequence of 4 8-bit values. I guess that will work. I'll give best answer just for the link. — morningstar, Aug 15 '13 at 01:07

score 2 · Answer 2 · answered Aug 13 '13 at 23:52

2

In my experience, if you have a collection of things and you want to compute their hash, it is best to compute the hash for each individual object separately; collect all of those hash values into an array. Finally, compute the hash of your array of hash values.

All of the simpler techniques break down relatively quickly. (Like XORing the values together or multiplying by magic numbers and summing--these have all sorts of pathological failure cases.) The one extra array hash you compute at the end is a small cost and pays off overall.

answered Aug 13 '13 at 23:52

StilesCrisis

15,972
4
39
62

"compute the hash of your array of hash values" - explain how this is an obvious step. – morningstar Aug 14 '13 at 04:14
In my case, I used Bob Jenkins' "lookup3" algorithm for hashing a block of memory. http://burtleburtle.net/bob/hash/doobs.html You can use any approach you prefer--CRC32, MD5, Adler-32, anything that takes an opaque block of memory in and returns a hash value out. – StilesCrisis Aug 14 '13 at 04:43

score 0 · Answer 3 · answered Aug 13 '13 at 23:53

0

A good hash function will work equally well for a string of any bits - not just characters. However, since a collection may:

Not necessarily be in a contiguous block of memory, and
Include portions that you wouldn't want to include in the hash (e.g. pointers from one element of a linked list to another, which would be different for different linked lists that have the same content but which, for this case, you would want to have the same hash value).

... it seems to me that the key question here may be "what is the best way to combine a set of individual hash values to produce a hash value for a collection?".

XORing the hash values of the individual elements in the collection would be a reasonable approach, in my view. The only problem I can immediately see is that it would lead to two collections with the same elements, but included in different orders, hashing to the same value. An algorithm to avoid this problem could look like this:

Find the hash values of the items in the collection.
Create a bitstring by concatenating those hash values in the order the elements appear in the collection.
Use any reasonable hashing algorithm to generate a hash value for that bitstring of hash values.
Use the hash value calculated in the last step as the hash value for the collection.

answered Aug 13 '13 at 23:53

Simon

10,679
1
30
44

XORing the values together does not work well in the general case. I've been down that road and watched it fail too often. Your second approach (append hash results into one bitstring, then hash the bitstring) works well. That's basically a reword of what I recommended in my answer :) – StilesCrisis Aug 13 '13 at 23:56
I've found an earlier question and answer set in which methods for combining hash values were discussed. See [Why is XOR the default way to combine hashes?](http://stackoverflow.com/questions/5889238/why-is-xor-the-default-way-to-combine-hashes). Briefly: XOR is good because it maintains entropy, but XORing identical values gives a zero result (and identical values are common so this is bad) and XORing is commutative (which is a point I made in my answer), which can also be undesirable. Hence hashing the bitstring of individual hash values, as suggested by @StilesCrisis, is preferable. – Simon Aug 14 '13 at 01:37
That sounds alright, but I've come across references that some string hash functions are optimized for ASCII characters. E.g. the bottom 4 bits are expected to be the most significant. – morningstar Aug 14 '13 at 04:19
From personal experience, I've seen that XORing works extremely fast even though it causes a lil more clashes than the benchmark. Collisions are unavoidable, any hash function is bound to cause them. I would recommend, making a small test suite. Choose multiple hashing functions and compare on your own. I did this, and the speed vs collission tradeoff made me go with the xor. – Kshitij Banerjee Aug 14 '13 at 10:51
Kshitij, I did the same as you, and made the same choice, then years later I wondered why my hash table was colliding so much. Changed it and suddenly my hash tables were working great again, at the cost of calling the hash function once more on a tiny blob of opaque data. – StilesCrisis Aug 14 '13 at 21:06

Simple general-purpose hash function for a collection

3 Answers3