What is a good hash function for a collection (i.e., multi-set) of integers?

Question

I'm looking for a function that maps a multi-set of integers to an integer, hopefully with some kind of guarantee like pairwise independence.

Ideally, memory usage would be constant, and the hash value could be updated in O(1) time after an insert/delete. (This forbids doing something like sorting the integers and using a hash function like h(x) = h_1(x_1, h_2(x_2, h_3(x_3, x_4))).)

XORing hashes together doesn't work because h({1,1,2}) = h({2})

I think multiplying hashes together modulo a prime might work if the underlying hash function had an unrealistically strong guarantee, such as n-independence.

score 7 · Accepted Answer · edited Apr 13 '17 at 12:32

7

I asked this same question on cstheory.stackexchange.com and got a good answer:

https://cstheory.stackexchange.com/questions/3390/is-there-a-hash-function-for-a-collection-i-e-multi-set-of-integers-that-has

edited Apr 13 '17 at 12:32

Community

1
1

answered Jan 06 '11 at 02:01

jonderry

23,013
32
104
171

score 4 · Answer 2 · answered Nov 15 '10 at 10:50

I agree with Dzmitry on using of arithmetic SUM of hashes, but I'd recommend using a hash function with good output distribution for input integers instead of just reversing bits in the integer. Reversing bits doesn't improve output distribution. It can even worsen output distribution, since the probability that the high order bits will be lost due sum overflow is much higher that the probability that the low order bits will be lost in this case. Here is an example of a fast hash function with good output distribution: http://burtleburtle.net/bob/c/lookup3.c . Read also the paper describing how hash functions must be constructed - http://burtleburtle.net/bob/hash/evahash.html .

Using SUM of hash values for each element in the set satisfies requirements in the questions:

memory usage is constant. We need to store an ordinary integer containing hash value per each set. This integer will be used for O(1) updating of the hash when adding/removing elements from the set.
Addition of a new element requires only addition of the element's hash value to the existing hash value, i.e. the operation is O(1).
Removing of existing element requires only subtraction of the element's hash value from the existing hash value, i.e. the operation is O(1).
The hash will be different for sets, which differ only by pairs of identical elements.

SUM and SUB are safe operations in the face of integer overflow, since they are reversible in a modular arithmetic, where modulus is 2^32 or 2^64 for integers in java.

This may work in practice, though it's easy to raise the odds of a collision fairly significantly. Consider the odds that {1, 1,..., 1} collides with {2, 2,..., 2} if there are, say, 2^16 copies in each set. This can be helped by taking the sum modulo a prime. Still, there's no theoretical guarantee, which is something else I'm curious about (whether one can get a good theoretical guarantee on the distribution of a hash function meeting the above specs for a collection). — jonderry, Nov 15 '10 at 16:20
> Reversing bits doesn't improve output distribution. It does. If majority of input values are close-to-zero, as I stated, then it does it in the best way. — Dzmitry Lazerka, Nov 16 '10 at 03:21
"... high order bits will be lost" No bits are lost while doing SUM. You have proved that yourself in the last sentence :) — Dzmitry Lazerka, Nov 16 '10 at 03:32
But your proposal is also good, I think. It is tolerant to cases when input data is all-even or all-odd numbers (or greater powers of 2). — Dzmitry Lazerka, Nov 16 '10 at 03:56
But if the input data is free of such patterns, reverse-bits is the best function for general hashtables. That's because changes in lower bits if input numbers, causes placing the hashes in different buckets of hashtable. While your proposal does not distinguish is input data close-to-zero (as it usually the case) or not. — Dzmitry Lazerka, Nov 16 '10 at 04:02

Dzmitry Lazerka · Answer 3 · 2010-11-15T09:30:04.303

Reverse-bits.

For example 00001011 become 11010000. Then, just SUM all the reversed set elements.

If we need O(1) on insert/delete, the usual SUM will work (and that's how Sets are implemented in Java), though not well distributed over sets of small integers.

In case our set will not be uniformly distributed (as it usually is), we need mapping N->f(N), so that f(N) would be uniformly distributed for the expected data sample. Usually, data sample contains much more close-to-zero numbers than close-to-maximum numbers. In this case, reverse-bits hash would distribute them uniformly.

Example in Scala:

def hash(v: Int): Int = {
        var h = v & 1
        for (i <- 1 to 31) {
                h <<= 1;
                h |= ((v >>> i) & 1)
        }
        h
}
def hash(a: Set[Int]): Int = {
        var h = 0
        for (e: Int <- a) {
                h += hash(e);
        }
        h
}

But the hash of our multi-set will not be uniform, though much better than simple SUM.

score 1 · Answer 4 · edited May 23 '17 at 12:02

1

I have once asked a similar question, "Good hash function for permutations?", and got a hash that worked very well for my use case, I have very few collisions in my working code. It might work well for you too. Calculate something like this:

// initialize this->hash with 1
unsigned int hash = 1;
void add(int x) {
  this->hash *= (1779033703 + 2*x);
}

So whenever you add a number x, update your hash code with the above formula. The order of the values is not important, you will always get the same hash value.

When you want to merge two sets, just multiply the hash value.

The only thing I am not sure if it is possible is to remove a value in O(1).

edited May 23 '17 at 12:02

Community

1
1

answered Nov 15 '10 at 08:58

martinus

17,736
15
72
92

1

This is similar to what I suggested in the question. You can delete by multiplying by the modular inverse of the hash of the element to be removed. This requires keeping the hash of the collection modulo a prime (and ensuring that the hash of an element is non-zero). The question I have about this method is how to get a theoretical guarantee. – jonderry Nov 15 '10 at 16:02
Is there a way this hash function can be modified to support order dependence? i.e. `[a, b, c]` and `[b, c, a]` producing different hash values? – jeffreyveon Sep 06 '20 at 07:17
@jeffreyveon if you need order dependence then you can use any hash. – martinus Sep 08 '20 at 12:30
I have a list of numbers. Each number in the list can be hashed independently, but how do I then combine those hashes to a single hash that signifies the list as a whole? One approach I have taken is to use the index position as part of the hash update, like this: `this->hash *= (1779033703 + 2*x*i);` where `i` is the index of element x in the list. – jeffreyveon Sep 08 '20 at 13:20
you could use something like boost::hash_combine, but I don't see a reason for using anything special? why not just use e.g. murmurhash to hash the whole data block? – martinus Sep 14 '20 at 13:08

Tobu · Answer 5 · 2010-11-14T12:52:46.510

Min-hashing should work here. Apply permutation, maintain a small multiset of n minimal elements, pick the biggest.

Elaborating: this is a simple way to work in O(1) time and space. You need something like a priority queue, without making the link to the initial values too obvious. So you order your priority queue according to some elaborate key, which is equivalent to running a priority queue on a permutation of the normal sort order. Make the queue keep track of multiplicity so that the selected elements also form a multiset.

That said, I'm not sure this disperses well enough (and running multiple permutations might become costly), so maybe build on Bradley's answer instead. Here is a tweak so that repeated elements don't cancel out:

xor(int_hash(x_n, multiplicity_n) foreach n)

What is a good hash function for a collection (i.e., multi-set) of integers?

5 Answers5

Reverse-bits.

Linked