I have many integers in range [0; 2^63-1]. There is only 10^8 integers, however. There is no duplicates. Full list is known at compile-time but it is just unique random numbers. These numbers never changes.
To store one integer explicitly, 8 bytes required, and there is associated 1-byte values, so explicit storing requires about 860 MB.
So I want to find minimal perfect hash function to map each of 10^8 integers from [0;2^63-1] to [0;10^8-1]. I should find this function only once, data never changes, and function can be complicated. But it should be minimal, perfect, and calculating should be fast. How I can do this better? Maybe it is possible to find and use some subsequences if they happens?
Thanks.

- 163
- 1
- 4
-
Full list known at compile-time? My advice then would be to 'manually' assign the numbers yourself, and then write a script to spit out a static declaration of a map in your desired programming language. If it never, ever changes, using a static data structure to perfectly map the values would be your ideal solution. I say 'manually' with inverted commas because you clearly aren't going to do it by hand. See other comments and answers for ideas of what tools can do the assigning for you. – Jul 19 '11 at 07:05
2 Answers
Let your computer do the work for you:
http://www.gnu.org/software/gperf/
Quote: "GNU gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only. "

- 1,801
- 3
- 20
- 32
-
3but for this, [CMPH](http://cmph.sourceforge.net/) would be better as it was conceived to create minimal perfect hash functions for very large sets of keys. – Dan D. Jul 19 '11 at 07:00
-
I'm working on an algorithm and Java implementation that needs less than 1.6 bits per key.
Previously, I have implemented a minimal perfect hash function tool in Java that needs less than 2.0 bits per key.
Other algorithms are implemented in CMPH. For example CHD needs about 2.06 bits per key by default. It can be configured to use less space, but generation is then slower.
Update: There is now a paper about my invention called "RecSplit: Minimal Perfect Hashing via Recursive Splitting"

- 48,905
- 14
- 116
- 132
-
I'm working on an improved algorithm that needs less than 1.58 bits per entry. – Thomas Mueller Dec 12 '15 at 11:44
-
Do you have anywrite up for your code. I was trying to implement it for Long data types but was getting indexoutofbounds error – sss999 Nov 22 '17 at 01:54
-
@sss999 there is not much documentation currently; you could read the test cases. Maybe create a [issue](https://github.com/thomasmueller/minperf/issues) with a test case and exception, so I can look into what the problem could be – Thomas Mueller Nov 22 '17 at 07:10
-
I was looking at the LongCollection test case since i need i have long values as input. Once I have the buff ref, can you tell me where are you storing the hash values for the input and how can i read them? – sss999 Nov 22 '17 at 08:23
-
@sss999 Hash values are not stored. I think it makes sense to ask a new question, and provide the code you have used and what you need. You can then add a comment for me, so I see the question. – Thomas Mueller Nov 22 '17 at 13:31