Good hash function for unordered set of ID's

Question

I'm having the following problem:

Given an unordered, arbitrary large set of id's (e.g. 1,2,5,6,8 in a 32-bit space)
Calculate a hash code in a larger space (e.g. 64-bit).

The easy was is to calculate a hash function for each individual ID and then XOR everything together. However, if you have a 32-bit space for ID's and a 64-bit space for the hash function, that might not be the best way to approach this (collisions and so on...).

I've been thinking about using the Murmur3 finalizer and then XOR'ing the results together, but I guess that won't work either for the same reason (I'm not sure to be honest). Similarly, simply multiplying values should also work (because ab = ba), but I'm not sure how 'good' that hash function will be.

Obviously, sorting the ID's comes to mind, after which Murmur3 will do nicely. Still, I don't want to sort if I can avoid it.

What's a good algorithm for such a hash function?

update

OK, I guess I might have been a bit confusing.

The second answer on Why is XOR the default way to combine hashes? actually explains about combining hash functions. In the case presented there, XOR is argued to be a bad hash function, because "dab" produces the same code as "abd". In my case, I want these things to generate the same hash value - but I also want to minimize the chance that -say- "abc" also generates the same hash value as -say- "abd".

The whole purpose of most hash functions is that they have a high probability to use the complete key space if you feed them data. In general, these hash functions exploit the fact that data is sequential, and multiply by a large number to shuffle bits around. So in simple terms:

var hash = SomeInitialConstant;
foreach (var id in ids) {
  hash = hash * SomeConstant + hashCode(id);
}
// ... optionally shuffle bits around as finalizer
return hash;

Now, this works fine if the ID's are always in the same order. However, if ID's are unordered, this won't work, because x * constant + y is not commutative.

If you square the ID's, I don't think that you will end up using the entire hash space. Consider what happens if you have large numbers, say 100000, 100001, etc. The quares of those are 10000000000, 10000200001, etc. There's no way you'll get a square ever to generate a number like 900000 (simply because sqrt(900000) is a number with a fraction).

In more general terms, it's likely that all the hash space between 10000000000 and 10000200001 will get lost. However, the space between -say- 0 and 10 will get a lot of collisions, because the available hash space between the squares of small numbers if small as well.

The whole purpose of using a large key space is obviously having few collisions. I would like to have a pretty large hash space (say, 256 bits) to ensure collisions are virtually non-existent in real-life scenario's.

Do you want your output to be a 32-bit hash code, or a 64-bit hash code? — Jim Mischel, Dec 16 '16 at 16:24
@JimMischel I want my output to be in the larger domain (e.g. 64-bit). I just noted 64-bit as example here btw; eventually I want to move to f.ex. 256 bits to avoid possible collisions. — atlaste, Dec 16 '16 at 16:26
I didn't downvote, but I did consider it for a moment: the question is a bit confusing, had to take a moment to wrap my head around it. Anyhow, [about combining hashes](http://stackoverflow.com/q/5889238/477453). The nice thing is that you don't need to worry about "a xor a = 0" (if the IDs are unique), and that commutativity is actually your friend there. — SáT, Dec 16 '16 at 16:35
For a start, you could square the numbers and add the squares (all modulo 64 bits) — joop, Dec 16 '16 at 17:46
@SáT I see you point; I've tried explaining it a bit more. Yes, you're right, commutativity is my friend, which is why I've been wondering about xor. However, that still gives issues about how to "blow up" the 32-bit hash code to something bigger. With regards to that, XOR is actually a bad option, because it doesn't 'generate' bits, and ADD and MUL which are also commutative actually do. Still, because of the reasons I describe in the additional info, it might not be that simple... — atlaste, Dec 16 '16 at 20:53
@joop, squaring will basically ruin your key space. I've added an explanation in the question. — atlaste, Dec 16 '16 at 20:55
@atlaste well: squaring + adding + modulo 64 bits does have the property of cummutativity working in your advantage, so all the permutations of your set will hash to the same value. Instead of squaring you could use Zobrist hashing, and instead of summing you could use XOR. — joop, Dec 19 '16 at 09:59
BTW : 900000 is the *sum* of two squares: `(900*900 + 300*300)` — joop, Dec 19 '16 at 14:13
@joop Yes, I guess you can create just about any number using the sum of squares. I understand the algorithm you explain... What I'm mostly worried about is the number of collisions I will get using methods like this. I'm not sure if it's clear from the description, but I intend to use the hash code as a checksum. — atlaste, Dec 19 '16 at 14:17

joop · Accepted Answer · 2016-12-23T11:28:18.317

I just checked :

using 32 bit hashes
in a 64K table of arrays
with 64K items (load-factor = 100%)
of 8 bit values (unsigned characters )
(array size 4...64)
hash function := cnt+ (sum cube (arr[i]))
or := sum(square (zobrist[arr[i]))
Zobrist works a bit better, (but array needs fursther randomisation)
and collisions were not more than expected for an optimal hash function.
to avoid recomputation (space-time tradeoff), I actually store the hash values within the objects
since collisions are a fact of life, you could postpone the sorting to the moment when you actually need it for the final comparison (when the chain length start to grow above 1)

#include <stdio.h>
#include <stdlib.h>

struct list {
        struct list *next;
        unsigned hash;
        unsigned short cnt;
        unsigned char *data;
        };

struct list *hashtab[1<<16] = {NULL, };
#define COUNTOF(a) (sizeof a / sizeof a[0])
unsigned zobrist[256] = {0,};
/*************************/
unsigned hash_it(unsigned char *cp, unsigned cnt)
{
unsigned idx;
unsigned long long hash = 0;

for(idx=0; idx < cnt; idx++) {
#if 0   /* cube */
        hash += (cp[idx] * cp[idx] * cp[idx]);
#else
        unsigned val;
        val = zobrist[cp[idx]];
        hash += (val * val);
#endif
        }
#if 0   /* as a tie-breaker: add the count (this avoids pythagorean triplets but *not* taxi-numbers) */
hash += cnt;
#endif
return hash;
}
/*************************/
struct list *list_new(unsigned cnt){
struct list *p;
unsigned idx;

p = malloc( sizeof *p + cnt);
p->data = (unsigned char*)(p+1);
p->cnt = cnt;
p->next = NULL;

for(idx=0; idx < cnt; idx++) {
        p->data[idx] = 0xff & rand();
        }
p->hash = hash_it(p->data, p->cnt);
return p;
}
/*************************/
void do_insert(struct list *this)
{
struct list **pp;
unsigned slot;

slot  = this->hash % COUNTOF(hashtab);
for (pp = &hashtab[slot]; *pp; pp = &(*pp)->next) {;}
*pp = this;
}
/*************************/
void list_print(struct list *this)
{
unsigned idx;
if (!this) return;

printf("%lx data[%u] = ", (unsigned long) this->hash, this->cnt);

for (idx=0; idx < this->cnt; idx++) {
        printf("%c%u"
        , idx ? ',' : '{' , (unsigned int) this->data[idx] );
        }
printf("}\n" );
}
/*************************/
unsigned list_cnt(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) { cnt++; }
return cnt;
}
/*************************/
unsigned list_cnt_collisions(struct list *this)
{
unsigned cnt;
for(cnt=0; this; this=this->next) {
        struct list *that;
        for(that=this->next; that; that=that->next) {
                if (that->cnt != this->cnt) continue;
                if (that->hash == this->hash) cnt++;
                }
        }
return cnt;
}
/*************************/
int main(void)
{
unsigned idx, val;
struct list *p;
unsigned hist[300] = {0,};

        /* NOTE: you need a better_than_default random generator
        ** , the zobrist array should **not** contain any duplicates
        */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] = val;
        }

        /* a second pass will increase the randomness ... just a bit ... */
for (idx = 0; idx < COUNTOF(zobrist); idx++) {
        do { val = random(); } while(!val);
        zobrist[idx] ^= val;
        }
        /* load-factor = 100 % */
for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        do {
          val = random();
          val %= 0x40;
        } while(val < 4); /* array size 4..63 */
        p = list_new(val);
        do_insert(p);
        }

for (idx = 0; idx < COUNTOF(hashtab); idx++) {
        val = list_cnt( hashtab[idx]);
        hist[val] += 1;
        val = list_cnt_collisions(hashtab[idx]);
        if (!val) continue;
        printf("[%u] : %u\n", idx, val);
        for (val=0,p = hashtab[idx]; p; p= p->next) {
                printf("[%u]: ", val++);
                list_print(p);
                }
        }

for (idx = 0; idx < COUNTOF(hist); idx++) {
        if (!hist[idx]) continue;
        printf("[%u] = %u\n", idx, hist[idx]);
        }

return 0;
}
/*************************/

Output histogram (chain lengths, 0 := empty slot):

$ ./a.out
[0] = 24192
[1] = 23972
[2] = 12043
[3] = 4107
[4] = 1001
[5] = 181
[6] = 34
[7] = 4
[8] = 2

Final note: instead of sum of squares of Zobrist[] you could just as well XOR them together (assuming the entries are unique)

Extra final note: the C stdlib rand() function can be unusable. RAND_MAX could be only 15 bits: 0x7fff (32767). For populating the zobrist table you need more values. This can be done by XORing some extra (rand() << shift) into the higher bits.

New results, using (a sample from) a very large source domain (32 elements * 8bits), hashing it to 32bits hashkeys, inserting into a hashtable of 1<<20 slots.

Number of elements 1048576 number of slots 1048576
Element size = 8bits, Min setsize=0, max set size=32
(using Cubes, plus adding size) Histogram of chain lengths:
[0] = 386124 (0.36824)
[1] = 385263 (0.36742)
[2] = 192884 (0.18395)
[3] = 64340 (0.06136)
[4] = 16058 (0.01531)
[5] = 3245 (0.00309)
[6] = 575 (0.00055)
[7] = 78 (0.00007)
[8] = 9 (0.00001)

This is very close to optimal; for a 100% loaded hash table, the first two entries in the histogram should be equal, in the perfect case, both 1/e. The first two entries are empty slots and slots with exactly one element.

Thank you so far for the extensive answer. I'll need to take some time to study it in more detail; will get back to you. — atlaste, Dec 20 '16 at 08:31
The way I understand this implementation that would imply having a zobrist table of 2^32 entries since that is my key space... And in truth, my indices actually do range in the billions. Still, the idea of testing it with less bits is a good one (I should have thought about that...); I did some experiments of my own on 8 bits and found that about 10% of the cases will have a collision for the x^4 case using a XOR (testing all series of 4 integers). I guess that's still too much, but this does give me a start to work with. It also shows that ADD outperforms XOR. — atlaste, Dec 20 '16 at 09:21
In the case of 2^32 entries the zobrist table would probably become too large (32GB). Squaring or cubing is the way to go. (I guess) — joop, Dec 20 '16 at 09:47
Yes. I've just been considering using a multiply-by-constant as well. So instead of using `hash=[x*x]`, I can use `y=x*[some number]` and then `hash = (x*x)%(2^32) | ((y*y) % 2^32)<<32` . In my experiments, it works about as good as the simple squaring function, but enables me to use more bits (e.g. it scales better with larger keys). I've tested it on 16 bits, and it works just as good as Xor and Add. While that's all great, I still don't have a complete understanding of the math behind the collisions; getting there... — atlaste, Dec 20 '16 at 10:16
Cubing works better than squaring, emperically. The math is too difficult to explain to a mathematician anyway. — joop, Dec 20 '16 at 10:32
yes, marginally. The difference is 33% according to my tests. I'll post my test code in an answer along with some of my findings. — atlaste, Dec 20 '16 at 10:38

score 0 · Answer 2 · answered Dec 20 '16 at 01:07

In my case, I want these things to generate the same hash value - but I also want to minimize the chance that -say- "abc" also generates the same hash value as -say- "abd".

Bitwise-XOR actually guarantees that: if two sets of the same size are the same except for one element, then they will necessarily have different bitwise-XORs. (Incidentally, the same is true of summation with wraparound: if two sets of the same size are the same except for one element, then they will necessarily have different sums-with-wraparound.)

So if you use bitwise-XOR for the bottom 32 bits, then you essentially have 32 "extra" bits to try to reduce collisions further: to reduce cases where two sets of different sizes have the same checksum, or cases where two sets that differ by two or more elements have the same checksum. A relatively simple approach is to select a function f that maps from 32-bit integers to 32-bit integers, and then apply bitwise-XOR to the result of applying f to each element. Major things you'd want from f:

it should be cheap and simple to implement.
it should map zero to a nonzero value (so that { 1, 2, 3 } and { 0, 1, 2, 3 } will have different checksums).
the mapping should not involve reorganizing bits in a constant way (e.g. bit-shifts), since reorganize_bits(a) XOR reorganize_bits(b) is equivalent to reorganize_bits(a XOR b), so it doesn't add any independent information to your checksum.
the mapping shouldn't involve XOR-ing with constants, for the same reason.

Above, joop suggests f(a) = a² MOD 2³², which seems decent to me, except for the issue with zero. Maybe f(a) = (a + 1)² MOD 2³²?

The zero case was one of the reasons I added the setsize to the hash. — joop, Dec 20 '16 at 09:44

score 0 · Answer 3 · answered Dec 20 '16 at 11:22

This answer is just for completeness.

From the solution by @joop, I noticed that he uses less bits than me. Also, he also proposed using x^3 instead of x^2, which makes a huge difference.

In my code, I use 8 bit id's for the test, because of the resulting small key space. This means we can simply test all chains with a length up to 4 or 5 id's. The hash space is 32 bits. The (C#) code is very simple:

static void Main(string[] args)
{
    for (int index = 0; index < 256; ++index)
    {
        CreateHashChain(index, 4, 0);
    }

    // Create collision histogram:
    Dictionary<int, int> histogram = new Dictionary<int, int>();
    foreach (var item in collisions)
    {
        int val;
        histogram.TryGetValue(item.Value, out val);
        histogram[item.Value] = val + 1;
    }

    foreach (var item in histogram.OrderBy((a) => a.Key))
    {
        Console.WriteLine("{0}: {1}", item.Key, item.Value);
    }
    Console.ReadLine();
}

private static void CreateHashChain(int index, int size, uint code)
{
    uint current = (uint)index;

    // hash
    uint v = current * current;
    code = code ^ v;

    // recurse for the rest of the chain:
    if (size == 1)
    {
        int val;
        collisions.TryGetValue(code, out val);
        collisions[code] = val + 1;
    }
    else
    {
        for (int i = index + 1; i < 256 - size; ++i)
        {
            CreateHashChain(i, size - 1, code);
        }
    }
}

private static Dictionary<uint, int> collisions = new Dictionary<uint, int>();

Now, this is all about the hash function. I'll just write down some of the things I tried with the findings:

x^2

Code:

// hash
uint v = current * current;
code = code ^ v;

Results: lots and lots and lots of collisions. In fact, there isn't a case that doesn't collide less than 3612 times. Obviously we're using only 16 bits, so it can be explained just fine. Anyways, the result is Pretty Bad.

x^3

Code:

// hash
uint v = current * current * current;
code = code ^ v;

Results:

Still pretty bad, but again, we're only using 24 bits of the key space so collisions are bound to happen. Also, it's much better than using x^2.

x^4

Code:

// hash
uint v = current * current;
v = v * v;
code = code ^ v;

Results:

1: 118795055
2: 20402127
3: 2740658
4: 329621
5: 38453
6: 4420
7: 495
8: 47
9: 12

As expected, this is much better and obviously this is due to the fact that we're using the full 32-bits now.

Introducing y

Another way to introduce a larger key space is to introduce another variable -say- y which is a function of x. The idea behind this is that x^n for small values of x will result in small numbers, thereby having a high chance of collisions; we can offset this by ensuring that y will be a large number if x is small and doing bit arithmetic to combine two hash functions. The easiest way to do this is to cause bit flips for all bits:

// hash
uint x = current;
uint y = (255 ^ current);

uint v1 = (UInt16)(x * x * x);
uint v2 = (UInt16)(y * y * y);
code = code ^ v1 ^ (v2 << 16);

This will result in the following:

1: 154971022
2: 6827322
3: 235081
4: 7554
5: 263
6: 9
7: 1

Interestingly, this immediately gives much better results than all the previous approaches. It also immediately raises the question if the 16-bit cast makes any sense. After all, x^3 will result in a 24-bit space with large gaps for small values of x. Combining that with another 24-bit space that's shifted will make better use of the available 32 bits. Note that we should still shift by 16 (and not 8!) for the same reason.

1: 162671251
2: 3276751
3: 45277
4: 473
5: 5

Multiply by constant (final result)

Another way to blow up the key-space for y is to multiply-and-add. The code now becomes:

uint x = current;
uint y = (255 ^ current);
y = (y + 7577) * 0x85ebca6b;

uint v1 = (x * x * x);
uint v2 = (y * y * y);
code = code ^ v1 ^ (v2 << 8);

While this might not seem like an improvement, it has the advantage that we can easily scale 8 bit sequences to any arbitrary n bit sequence using this trick. I shift by 8 because I don't want the bits of v1 to interfere too much with the bits of v2. This gives the following result:

1: 162668435
2: 3277904
3: 45459
4: 464
5: 5

This is actually pretty good! We only have a 2% chance to collide, given all possible chains of 4 id's. Also, if we have larger chains, we can add more bits by using the same trick we performed with v2 (adding 8 bits for each additional hash code, so a 256 bit hash should be able to accomodate chains of about 29 8-bit id's).

Only question that remains is: how can we test this? As @joop pointed out in his program, the math is actually quite complex; random sampling might actually prove to be a solution for this for larger number of bits and larger chains.

Note: instead of XORing you could sum the terms in the hash function. This is a more *natural* way to populate the higher bits. Shifting or multiplying the terms only causes the higher bits to be a function of the lower bits. Note: I don't understand your histograms. Why is entry `0: xxxx` absent? — joop, Dec 20 '16 at 11:37
The histogram is the number of occurrences, so 1 is the case without collisions, 2 with 1 collision, etc. As for XOR/ADD, I tested summing and was surprised to find it was actually worse than XOR. — atlaste, Dec 20 '16 at 11:39

Good hash function for unordered set of ID's

3 Answers3