1

I am building a C++ open addressing Hash Table. It consists of an array of:

struct KeyValue {
    K key;
    V value;
}

with the type Key having two special elements: empty and tombstone. The first one is used to note that the slot is free, and the second one is used to note that the slot has been used but later deleted (it is necessary for probing).

The main challenge is to design an efficient API for this structure. I want to minimize the number of times a key is hashed and a slot is looked for.

So far, I have the following API which I find unsafe:

// Return the slot index if the key is in the table
// or a slot index where I can construct the KeyValue
// if the key is not here (or -1 if there is no slot
// available and the insertion of such a key would
// need to grow the hash table)
int search(const K& key)

// Tells if the slot is empy (or if i == -1)
bool empty(int i)

// Construct a KeyValue in the HashTable in the slot i
// which has been found by search. The i might be changed
// if the table needs to grow.
void insert(const K& key, const V& value, int& i)

// Accessors for a slot i which is occupied
const V& value(int i);

Note that the table also have classic functions such as

void insert(const K& key, const V& value)

which computes the hash, search for a slot, and insert the pair into the table. But I want to concentrate here on the interface that allows a programmer to make a very efficient use of the table.

For instance, here is a function that return the value of f(key) if it has never been computed or give back its value from a HashTable if is has already been computed.

const V& compute(const K& key, HashTable<K, V>& table) {
    int i = table.search(key);
    if (table.empty(i)) {
        table.insert(key, f(key), i);
    }
    return table.value(i);
 }

I am not completely keen on the interface for this HashTable as the method insert(const K&, const V&, int&) feels really unsafe to me.

Do you have any suggestion for a better API?

PS: The talk "Performance with algorithms, efficiency with data structures" by Chandler Carruth, especially after 23:50 is really nice to understand the problems with std::unordered_map

Dev2017
  • 857
  • 9
  • 31
InsideLoop
  • 6,063
  • 2
  • 28
  • 55
  • 1
    What is your goal? Performance? Memory usage? In any case, can you elaborate on why `std::unordered_map` is not sufficient? – selbie Dec 27 '16 at 10:17
  • The goal is to get a perfomant HashTable. One of the problem with std::unordered_map is that solves collision with linked list which is bad for a performance point of view. Also, try to write "compute" with a std::unordered_map. – InsideLoop Dec 27 '16 at 10:21
  • Got it. I'm also confused why you want to expose the "slot index" back to the caller who inserts. Because the whole point of a hash table is that you can insert something by "key" and look it up later by "key". If you are going to return a slot index to the caller that he maintains for subsequent lookup, then your implementation might as well just be a flat array with an incremental index. But I think what you would really want is to just expose three methods: `void Insert(k,v)`, `void Remove(k)`, and `v Lookup(k)`. – selbie Dec 27 '16 at 10:33
  • Also, I wrote my own hash table class a few years ago with the goal to never allocate memory after the instance was constructed. Because on the server side, memory allocations were hurting performance. You are welcome to reference it or use it. [It's here on GitHub](https://github.com/jselbie/stunserver/blob/master/common/fasthash.h) with [unit tests](https://github.com/jselbie/stunserver/blob/master/testcode/testfasthash.cpp) – selbie Dec 27 '16 at 10:41
  • selbie: You need to return the i if you want to write a "compute" function which hash the key only once. Obviously, the regular usage of an hash table would be to ask for the value later and do it through the hash function. Thanks for your reference on github. – InsideLoop Dec 27 '16 at 10:50
  • why does the method insert() feel unsafe?? you can call it by other name or signature, but eventually key and value will have to be provided to container – basav Dec 27 '16 at 11:00
  • @basav: It uses the index i coming from a previous function. If you use it with the wrong index, you mess up completely your hash table. – InsideLoop Dec 27 '16 at 11:01
  • ok, i was looking at the other signature(below), slots should not be exposed to the end user. – basav Dec 27 '16 at 11:03
  • infact, keep the same api as stl unorderd_map has, just switch collision handling strategy to double hashing, if that is the intended end result – basav Dec 27 '16 at 11:04
  • @basav: Yes it would be better not to expose slots to theend user. But I can't find another way if I want to write "compute" without hashing the key twice. – InsideLoop Dec 27 '16 at 11:04
  • @basav: How do you write "compute" with std::unordered_map without hashing the key twice ? – InsideLoop Dec 27 '16 at 11:05
  • how about using linear probing and monitoring loadfactor so that sufficient buckets are available?? this is just a suggestion, but there are other ways to avoid double hash, and gets pretty convoluted – basav Dec 27 '16 at 11:18
  • basav: I use quadratic probing (but one could also use linear probing). You need to be able to grow the table because if you keep calling i = search(key), and insert(key, value, i) there will be a time when the table is full if you don't allow it to grow. – InsideLoop Dec 27 '16 at 11:21
  • closing a question if you problem is solved is always a good idea :) – Dev2017 Mar 12 '17 at 10:11

2 Answers2

3

I think you should try super fast hashing functions.

Check this out https://github.com/Cyan4973/xxHash. I quote from its description: "xxHash is an Extremely fast Hash algorithm, running at RAM speed limits. It successfully completes the SMHasher test suite which evaluates collision, dispersion and randomness qualities of hash functions. Code is highly portable, and hashes are identical on all platforms (little / big endian)."

Also this thread from another question on this site: Fast Cross-Platform C/C++ Hashing Library. FNV, Jenkins and MurmurHash are known to be fast.

Take a look at this post where I posted the same answer I did here, there are other answers too there: Are there faster hash functions for unordered_map/set in C++?

Community
  • 1
  • 1
Dev2017
  • 857
  • 9
  • 31
0

You can make a get_or_insert function template that accepts an arbitrary functor instead of the value. Which you can then call with a lambda:

template <class K, class V>
class HashTable {
private:
    int search(const K& key);
    bool empty(int i);
    void insert(const K& key, const V& value, int& i);
    const V& value(int i);

public:    
    template <class F>
    const V& get_or_insert(const K& key, F&& f) {
        int i = search(key);
        if (empty(i)) {
            insert(key, f(), i);
        }
        return value(i);
    }
};

double expensive_computation(int key);

void foo() {
    HashTable<int, double> ht;
    int key = 42;
    double value = ht.get_or_insert(key, [key]{ return expensive_computation(key); });
}

If get_or_insert is inlined and you don't need to capture a lot, this should be just as efficient as the code you showed. In doubt compare the generated code using Godbolt's Compiler Explorer or similar. (And if it doesn't get inlined it will still be OK, unless you have to capture a lot of different variables. Assuming that you capture smart - i.e. capture stuff by reference if it's expensive to copy.)

Note: The "standard" way of passing a functor in C++ seems to be by value, but I think passing by reference makes more sense. In case everything gets inlined it shouldn't make a difference (and didn't in the example that I checked with GCC, Clang and MSVC), and in case the get_or_insert call doesn't get inlined, you really don't want to copy the functor if it captures more than 1 or 2 small and trivial variables.

The only downside of using an universal reference that I can imagine is if you have a functor that mutates its state in operator(). And with such functors, at least in the examples that I can think of, I want the original functor to be mutated. So not really a downside IMO.


Or a modified version of the above, suitable if the values are expensive to create/assign/destroy (like std::string): Call the functor with a mutable reference to the value in the slot. Then the functor can directly assign/mutate the value in the hash table -> no need to construct and destroy a temporary.

Paul Groke
  • 6,259
  • 2
  • 31
  • 32