The way to hash a list is to combine the hashes for each item. Java implements the hash function for a list like so:
public int hashCode() {
int hashCode = 1;
for (E e : this)
hashCode = 31*hashCode + (e==null ? 0 : e.hashCode());
return hashCode;
}
Notable properties:
- The hash code for every empty list is exactly
1
.
- The hash code for lists with different numbers of elements are very likely to be different.
- The hash code for lists with the same number of elements is more likely to collide. Lists with the same elements in different order will collide; the hash code for lists
[1,2]
and [2,1]
are unfortunately identical.
- This is a big of a drawback, but not as big as you might think at first. Hash tables implement a fallback where it checks for hashcode equality first and total equality second. If the difference in ordering occurs near the front of the lists, this fallback check is quick. At worst case, it only takes a number of comparisons equal to the length of the lists.
All in all, this would be a pretty good hash function for your use case, even if you use the numeric value of each histogram entry for its hash code. The problem you really want to avoid with hash functions is common-divisibility, meaning you want outputs from your hash function to fall into different buckets of a hash table. The Wikipedia article covers the properties of a good hash function if you want more information.
To obtain a better hash code for a list of numbers, we should look at a better hash code for an individual number, specifically this answer.
unsigned int hash(unsigned int[] list) {
unsigned int hashCode = 0;
for (int i = 1; i < list.length; i++) {
hashCode = hashCode + list[i];
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode) * 0x45d9f3b;
hashCode = ((hashCode >> 16) ^ hashCode);
}
return hashCode;
}
I think that's a good adaptation, but I'm not an expect.
Regarding the efficiency of overflow, it's not a major slowdown unless you have to handle exceptions for it. In Java, arithmetic will never throw an overflow exception, instead just wrapping it around to the min or max value. There is no real drawback to having a negative hashcode, as long as your implementation of a hashtable supports it.