I'm having the following problem:
- Given an unordered, arbitrary large set of id's (e.g. 1,2,5,6,8 in a 32-bit space)
- Calculate a hash code in a larger space (e.g. 64-bit).
The easy was is to calculate a hash function for each individual ID and then XOR everything together. However, if you have a 32-bit space for ID's and a 64-bit space for the hash function, that might not be the best way to approach this (collisions and so on...).
I've been thinking about using the Murmur3 finalizer and then XOR'ing the results together, but I guess that won't work either for the same reason (I'm not sure to be honest). Similarly, simply multiplying values should also work (because ab = ba), but I'm not sure how 'good' that hash function will be.
Obviously, sorting the ID's comes to mind, after which Murmur3 will do nicely. Still, I don't want to sort if I can avoid it.
What's a good algorithm for such a hash function?
update
OK, I guess I might have been a bit confusing.
The second answer on Why is XOR the default way to combine hashes? actually explains about combining hash functions. In the case presented there, XOR is argued to be a bad hash function, because "dab" produces the same code as "abd". In my case, I want these things to generate the same hash value - but I also want to minimize the chance that -say- "abc" also generates the same hash value as -say- "abd".
The whole purpose of most hash functions is that they have a high probability to use the complete key space if you feed them data. In general, these hash functions exploit the fact that data is sequential, and multiply by a large number to shuffle bits around. So in simple terms:
var hash = SomeInitialConstant;
foreach (var id in ids) {
hash = hash * SomeConstant + hashCode(id);
}
// ... optionally shuffle bits around as finalizer
return hash;
Now, this works fine if the ID's are always in the same order. However, if ID's are unordered, this won't work, because x * constant + y
is not commutative.
If you square the ID's, I don't think that you will end up using the entire hash space. Consider what happens if you have large numbers, say 100000, 100001, etc. The quares of those are 10000000000, 10000200001, etc. There's no way you'll get a square ever to generate a number like 900000 (simply because sqrt(900000) is a number with a fraction).
In more general terms, it's likely that all the hash space between 10000000000 and 10000200001 will get lost. However, the space between -say- 0 and 10 will get a lot of collisions, because the available hash space between the squares of small numbers if small as well.
The whole purpose of using a large key space is obviously having few collisions. I would like to have a pretty large hash space (say, 256 bits) to ensure collisions are virtually non-existent in real-life scenario's.