Hash function to map unique array of int to an index of range 0..n

Question

I need to map an array of sorted integers, with the length varying from 1 to 4 at max, to an index of a global array. Like [13,24,32] becoming a number in range 0..n, and no other array mapping to that same number. The quantity of arrays is a few millions, and the mapping has to be "unique" (or at least with very few collisions for smaller arrays), because these arrays represent itemsets and I use the k-1 smaller itemset to build the one of size k.

My current implementation uses a efficient hash function that produces a double between 0..1 for an array, and I store the itemsets in STL Map, with the double as the key. Got from this article:

N. D. Atreas, C. Karanikas “A faster pattern matching algorithm based on prime numbers and hashing approximation “, 2007

I'm going to implement a parallel version of this in CUDA, so I can't use something like STL Map. I could create myself easily a self-balancing binary search tree as a map in the GPU global memory, but that would be really slow. So in order to reduce global memory access to a minimum, I need to map the itemset to a huge array in the global memory.

I've tried to cast the double as a long integer and hash it with a 64 bit hash function, but it produces some collision, as expected.

So, I ask if there's a "unique" hash function for doubles between 0..1, or for array of integers from size 1..4, that gives an unique index for a table of size N.

You're looking for a [minimal perfect hash function](http://en.wikipedia.org/wiki/Perfect_hash_function#Minimal_perfect_hash_function). — j_random_hacker, Oct 04 '14 at 20:22
But don't I need to provide an input to build the minimal hash function ? My arrays will vary, and they are combinations of size 1 to 4 from an array of single items which can also vary. — liwing, Oct 04 '14 at 21:15
I don't know much more about them than the name, I'm afraid. I expect it's very time-consuming (perhaps impossible) to build such a function for millions of elements, though I could be wrong. — j_random_hacker, Oct 04 '14 at 21:59

score 1 · Answer 1 · answered Oct 05 '14 at 02:17

1

If I make this an assumption about your arrays:

each item of your arrays (such as 13) are 32-bit integers.

Then what you ask is impossible.

You have at least 2^(32*4) possible values, or 128 bits. And you are trying to pack those into an array of much smaller size (20 bits for one million entries). You cannot do this without collisions (or there being some agreement amongst the elements, such as each element choosing the "next available index", but that's not a hash).

answered Oct 05 '14 at 02:17

Ian

841
5
10

Those are possible values, but there will be only some tens of millions. And actually, those array elements can be a short int instead, because the values aren't going to surpass 10 thousand. I'm not really looking for a perfect hash, because that needs input from the data, but something like the algorithm in the article I mentioned, which only produces collisions after many hashes. For my test data, it works like a perfect hash, because I tested it with many test cases and there were no collisions. The problem is that this hash is a double, and I was hoping for one who'd give me an index. – liwing Oct 05 '14 at 03:48

Hash function to map unique array of int to an index of range 0..n

1 Answers1