1

I have a set (static, known in compile time) of about 2 million values, 20 bytes each. What I need is a fast O(1) way to check if a given value is in this set. It seems that perfect hash function with a bit array is ideal for this, but I can't find a simple way to create it. There are some utilities such as gperf, but they are too complicated. Also, in my case it's not necessary to have a close to 100% load factor, even 10% is enough, but with guarantee of no collisions. Another requirement for this function is simplicity, without many conditions: it will run on GPU. What would you advice for this case?

aplavin
  • 2,199
  • 5
  • 32
  • 53
  • N=2 million is quite big. You will not find a perfect hash function, unless you know something about the contents of the 20 bytes. If the table is static you could construct the hashtable at startup time, like I showed in another question here. (need to dig that up) – wildplasser Jun 24 '12 at 11:21
  • I know the set in compile-time (it almost never changes), and need the way to create a simple perfect hash function. – aplavin Jun 24 '12 at 11:27
  • Why must it be perfect? What if it is *impossible* to find a perfect hash for 2M values? – wildplasser Jun 24 '12 at 11:31
  • Perfect hash is always possible (if I've understood its idea correctly...) – aplavin Jun 24 '12 at 12:51

2 Answers2

2

See my answer here. The problem is a bit different, but the solution could be tailored to suit your needs. The original uses a 100% load factor, but that could be easily changed. It works by shuffling the array in-place at startup-time (this could be done at compile time, but that would imply compiling generated code).

WRT the hashfunction: if you don't know anything about the contents of 20byte objects, any reasonable hashfunction (FNV, Jenkins, or mine) will be good enough.

Community
  • 1
  • 1
wildplasser
  • 43,142
  • 8
  • 66
  • 109
0

After reading more information about perfect hashing, I've decided not to try implementing it, and instead used cuckoo hashtable. It's much simpler and requires at most 2 accesses to the table (or any other number >1, the most used are 2..5) instead of 1 for perfect hashing.

aplavin
  • 2,199
  • 5
  • 32
  • 53