5

I am looking for some map which has fixed keys (fixed during initialization) and that does faster look-up. It may not support adding/updating elements later. Is there some algorithm which looks the list of keys and formulates a function so that it is faster to look-up later. In my case, keys are strings.

Update:

Keys are not known at compile time. But during initialization time of the application. There wont be any further insertions later but there will be lots of look-ups. So I want look-ups to be optimized.

balki
  • 26,394
  • 30
  • 105
  • 151
  • 3
    Look at [gperf](http://www.gnu.org/s/gperf/), it facilitates perfect hashing at compile time when all the keys for the hash table are known. – Seth Carnegie Dec 08 '11 at 09:50

4 Answers4

3

CMPH may be what you're looking for. Basically this is gperf without requiring the set at compile-time.

Though of course std::unordered_map as by C++11 might just do too, though possibly with a few collisions.

Since you lookup strings, for strings, a trie (any of the different trie flavours, crit-bit or whatever funky names they have) may also be worthwhile to look into, especially if you have many of them. There are a lot of free trie implementations freely available.
The advantage of tries is that they can index-compress strings, so they use less memory, which has a higher likelihood of having data in cache. Also the access pattern is less random, which is also cache-friendly. A hash table must store the value plus the hash, and indexes more or less randomly (not randomly, but unpredictably) into memory. A trie/trie-like structure ideally only needs one extra bit that distinguishes a key from its common prefix in each node.

(Note by the way that O(log(N)) may quite possibly be faster than O(1) in such a case, because big-O does not consider things like that.)

Damon
  • 67,688
  • 20
  • 135
  • 185
  • Trie is much slower than std::unordered_map for strings (aka std::string aka std::basic_string). Have tested with different optimization flags. And there are many reports in the Internet about that. – cppist May 01 '13 at 17:39
  • @cppist: This depends on the implementation and on the dataset (both its size and the actual data). `std::unordered_map` is a hash map. It is `O(1)` in respect of the actual lookup, but `O(N)` in respect to the string length, and it must do an additional `O(N)` comparison. A crit-bit tree or trie is `O(log(N))` in respect to both the key length and to the number of keys. It needs no final compare, it needs not touch data after the first different byte, and it is more cache-friendly, touching fewer pages. Insofar, the answer isn't all that easy, a hash _may_ indeed not be the fastest tool. – Damon May 10 '13 at 21:42
  • 1
    N is a number of words. C - is a number of collisions. S - string length. Trie searches for string for T=O1(S). Hash set searches for string for H=O2(S)+O3(C). But O1(S) is much bigger than O2(S). Hash set uses simple arithmetical operations under consequent data. But trie uses multiple dereferences and if-branches. Even if dereferencing and branching will be faster than simple arithmethics, common processors work better with sequential data rather than inconsequential. Good made straightforward trie is really slower than unordered_map aka hash set. Leastwise for strings of (char). – cppist May 12 '13 at 04:00
1

Note that these are distinct things: do you need an upper limit, do you need a fast typical rate, or do you need the fastest lookup ever, no questions asked? The last one will cost you, the first two ones may be conflicting goals.


You could attempt to create a perfect hash function based on the input (i.e. one that does not have collisions of the input set). This is a somehow-solved problem (e.g. this, this). However, they usually generate source code and may spend significant time generating the hash function.

A modification of this would be using a generic hash function (e.g. shift-multiply-add) and do a brute force search over suitable parameters.

This has to be traded off with the cost of a few string comparisons (which aren't that terribly expensive if you don't have to collate).

Another option is to use two distinct hash functions - this increases the cost of a single lookup but makes degradation slightly less likely than aliens stealing your clock cylces. It is rather unlikely that this would be a problem with typical strings and a decent hash function.

peterchen
  • 40,917
  • 20
  • 104
  • 186
  • 1
    +1 for considering to ask the "do you need an upper limit" question, plus your last paragraph. What you describe in the last paragraph is basically cuckoo hashing. It's slower for the individual lookup as you said (and for inserts, too), but it has a guaranteed upper bound on the worst case, which, if one has that requirement, is super cool. – Damon Dec 08 '11 at 11:48
0

Try google-sparsehash: http://code.google.com/p/google-sparsehash/

An extremely memory-efficient hash_map implementation. 2 bits/entry overhead! 
The SparseHash library contains several hash-map implementations, including 
implementations that optimize for space or speed.
Lei Mou
  • 2,562
  • 1
  • 21
  • 29
0

In a similar topic ((number of) items known at compile time) , I produced this one: Lookups on known set of integer keys. Low overhead, no need for perfect hash. Fortunately, it is in C ;-)

Community
  • 1
  • 1
wildplasser
  • 43,142
  • 8
  • 66
  • 109