I'm going to be using a key:value store and would like to create non-collidable hashes in Perl. Is there a Perl module, or function that I can use to generate a non-collidable hash function or table (maybe something like gperf)? I already know my range of input values.
-
Ah. Reading comprehension fail. Sorry about that... – Amadan Oct 21 '11 at 01:23
-
Nah, totally cool. Thanks for giving me a better understanding of how quickly one can build hashes in Perl :-) I might just end up using gperf with XS – EhevuTov Oct 21 '11 at 01:28
2 Answers
I can't find a pure Perl solution, closest is Reini Urban's examinations of using perfect hashes with a type system. If you were to do it in XS, the CMPH (C Minimal Perfect Hashing Library) might be more apropos than gperf. CMPH seems to be optimized for non-trivial key sizes and run-time generation.
The cost of generating a perfect hash function at runtime in Perl might swamp the value of using it. In order to gain benefit, you'd want it compiled and cached. So again, writing an XS module which generates the function from a fixed key list at XS compile time might be the best way to go.
Out of curiosity, how big is your data and how many keys does the set contain?

- 153,029
- 25
- 195
- 336
-
I'm JUST getting into how hashing works, so at the moment this seems like the way to go. I'm going to be using it as a key in a key:value store, possibly LevelDB. Basically, I need a key:value or multi-key:value store to just do simple de-duplication counts(aggregation) on a key on a high-write rate real-time system. The key is counted for a 24hr period, then that aggregate is dumped to a CSV file and the store will be deleted for that day. – EhevuTov Oct 21 '11 at 03:30
-
The data I'm wanting to store is about 1k long per record and the records total over 2gigs a day. The keys are pretty long; about 30char and a few int numbers. I don't know if that's doable. – EhevuTov Oct 21 '11 at 04:10
-
1@EhevuTov I would highly recommend that you first profile your system's performance with your stock database before tackling a perfect hash algorithm. Unless your data is pathological, and I suspect the stock hash algorithm for LevelDB is pretty good, hash collisions are unlikely to be your bottleneck. – Schwern Oct 21 '11 at 08:16
-
1@EhevuTov Reading up on LevelDB, the most glaring issue with it from a performance perspective is "only a single process (possibly multi-threaded) can access a particular database at a time" which greatly limits your access to the data, ability to work in parallel, or throw more hardware at the problem. You may wish to start with a less bare bones database. – Schwern Oct 21 '11 at 08:22
-
Well, it's a heavy write I/O system, so if I throw more threads at the I/O, I would think that will decrease and not increase performance. I would thread the hashes with some type of hash pool. I'm looking into SQLite and MongoDB as alternatives, but I doubt they would be as fast. – EhevuTov Oct 21 '11 at 18:30
You might be interested in Judy. It's not a hash table implementation, but it's supposedly a very efficient associative array implementation.
Mind you, Perl's hashes are very well tuned, and they automatically get rehashed when a bucket starts growing large.

- 367,544
- 15
- 269
- 518
-
thanks for the heads up. I'll maybe use Judy for another project. I'm working on a real-time system, so a re-hash isn't good for me. It looks like I might need to make worker processes for hashing as well. Not to sure how to do that yet. – EhevuTov Nov 04 '11 at 00:05