What role do branch mispredictions play in hash table lookup performance?

Question

A typical hash table lookup algorithm - including one of the ones claiming to be the fastest in the world - is structured something a little bit like this.

while (true) {
  if (currentSlot.isEmpty) return null;
  if (currentSlot.key == key) return currentSlot.value;
  currentSlot = GetNextSlot();
}

The important point is that it checks each slot, stops if it finds the right key or if it reaches the end, or continues searching if it doesn't. This is pseudocode illustrating the pattern, not a real implementation.

This looks like it should be a branch prediction nightmare. When the table is very full or very empty, prediction should be quite reliable, but under normal use I would expect the branching during the search to be quite random as it's dependent on the stored data.

I expected to find that high performance hash tables would optimise with tricks like checking batches of four keys at once between branches to reduce mispredictions, but this doesn't seem to be true.

Are branch mispredictions a significant cost in hash table lookups? If they are, why don't implementations typically try to avoid them? If they're not, why aren't they?

where is the "hash" part in the snippet that you showed? I mean `currentBucket` is supposed to hold `keys`, not a single `key`? A hash algorithm assumes that buckets group keys together: you "hash" the needed bucket and _then_ loop to find the needed key. How these keys are stored inside a bucket plays a role too — Eugene, Jul 20 '20 at 14:14
@Eugene the "hash" part would go above the snippet. I skipped it because it's not relevant to the part I'm asking about. The hash would be used to find the first slot, then the next slots would be found depending on the type of implementation - either by following pointers or open addressing. I mixed up my language between "bucket" and "slot", sorry. I've corrected. — Sam, Jul 20 '20 at 14:20
@Seabizkit Gets the next slot by whatever means appropriate to the table implementation. That was an attempt to abstract away the differences between the open and closed addressing, and the various probing techniques in the open addressing case. — Sam, Jul 20 '20 at 16:31
performance?? what are you timing..seems silly to do null checks on first loop but it this is around performance then surely you need to be able to see what the bulk of the work is? is GetNextSlot some sort of built in method i am unware of? — Seabizkit, Jul 20 '20 at 20:54
@Seabizkit That's not real code, don't worry about the details. It was only intended to be illustrative of the loop and branch structure. I'm not timing anything because it was a question about hash table implementation theory, not about a specific implementation. — Sam, Jul 20 '20 at 21:41
I don't understand why this question has received a couple of down votes. Is it related to what @Seabizkit said about `GetNextSlot()`, which received an up vote on the comment? I don't understand what the confusion is. I thought that "A typical hash table lookup algorithm [...] is structured something a little bit like this." made clear that the snippet isn't real implementation code, and that anybody answering would know about hash table implementations and recognise this pattern and what `GetNextSlot()` is abstracting over. The answerers - particularly @Peter-Cordes - understood me well. — Sam, Jul 21 '20 at 01:05
Thanks sam for clearing that up!!! and understand now what u getting at well i think. mmm i think adding the code snip confuses things. SO isnt really about such issues/questions. "Software Engineering" is probably a better fit. To me if you had read that entire link, and had some understanding of how it works...then it hard to understand unless you suggest an alternative. branch mis predictions... will effect performance... but without branches you could have really slow look-ups.... well if my basic understand is correct. so the role is they speed it up hugely and u take a hit for misses. — Seabizkit, Jul 21 '20 at 13:53
But i would argue they not misses they still finds... unless I'm completely wrong, could be #tag — Seabizkit, Jul 21 '20 at 13:54

score 2 · Answer 1 · answered Jul 20 '20 at 14:25

Are branch mispredictions a significant cost in hash table lookups?

This is highly dependent of the test-case. Indeed, when the hash-table is too big to fit in CPU caches, the latency of the main memory (usually 60-120 ns) is much bigger than the cost of branch mispredictions (usually 10-15 cycles). The same thing apply with the L3 cache although the effect is less visible.

If they are, why don't implementations typically try to avoid them?

The main reason is that this is hard and I guess not always possible (especially when the key and the value are (non-POD) objects.

score 1 · Answer 2 · answered Jul 20 '20 at 14:24

My understanding is that for good performance, you want to make your hash tables large enough that chaining to another bucket is rare. The rarer it is, the less it matters that the branch will usually mispredict (because branch prediction learns from the common cases that branching avoids the path that leads to chaining to the next bucket).

A good hash table will choose its chaining algorithm to distribute well, avoiding big buildups of long chains until the hash table is very nearly full. (If we're talking about "stealing" another bucket in the main table, rather than building a linked list for this entry. Linked lists are relatively slow to traverse / linear search because of the data dependency for following the next pointer.)

Also note that a cache-miss (or even L3 cache hit) may cost more than a branch miss, on modern CPUs with fast recovery that doesn't have to wait for the out-of-order back-end to drain before starting recovery. (What exactly happens when a skylake CPU mispredicts a branch?

Chaining to the adjacent bucket might not be optimal; I forget what the state of the art is. IIRC it's not because one chain creates a dense group of used-up buckets, so any other entries for those buckets would also have to chain, making the problem worse. You don't want to do more scattered memory fetches than you need.

Otherwise, if you were just chaining to adjacent entries in the array you're using as a hash table, then yes, branchlessly fetching some adjacent keys (from the same or next cache line) could be fine. If the entries are small enough, checking them in parallel with SIMD could possibly even be worth it. At least for integer keys or something else that can be compared efficiently; it wouldn't be worth doing 4 string-compares in parallel.

So yes, cache misses are bad. But if most of your lookups find their hit in the first bucket, then that's an easy pattern for branch prediction. In many use-cases for hash tables, the majority of lookups will be for existing entries that get used repeatedly.

More sophisticated branch prediction (like Intel Haswell and later's IT-TAGE) use recent branch history to form an index into the table of branch-prediction entries, so different branching leading to different kinds of lookups can use different predictions for the same branch in the same hash table code. So e.g. one function that typically finds entries not-present might have its hash lookups predict correctly (if they don't chain), while another function that looks up known stuff can also predict that correctly.

Still, in the case where your hash table is starting to fill up and you have a non-negligible amount of chaining, branch mispredicts would be something to worry about. (And measure with hardware performance counters, e.g. perf stat or perf record -e branch-misses ./my_hashtable_benchmark ; perf report on GNU/Linux.)

This is an excellent answer with a lot of good information. I think your main point that addressed my confusion was actually the first sentence. I had a misapprehension from another source that long chains (4+) were statistically common, but I realise now that's not true in good implementations, so the branch prediction is less important than I thought. Long chains can occur in open addressing tables with linear probing because of bunching, but if that's avoided then it's rarer than I thought at reasonable loads. — Sam, Jul 20 '20 at 16:29
@Sam: Branch prediction is still *important*, just not as hard for the CPU to get right because the pattern is simple (mostly or rarely taken). :P But yes, agreed, that's the key point of the answer. — Peter Cordes, Jul 20 '20 at 22:33

score 0 · Answer 3 · answered Jul 20 '20 at 14:34

0

Yes, it might hurt if there are multiple keys in a slot; that is why the performance of a hash table is amortized O(1).

Usually, the thing you are looking for will be in the very first slot and there are code patterns for that, for example java's HashMap has this piece of code :

 final Node<K,V> getNode(int hash, Object key) {
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if ((tab = table) != null && (n = tab.length) > 0 &&
        (first = tab[(n - 1) & hash]) != null) {
        if (first.hash == hash && // always check first node

Notice the always check first node. This is a very common pattern, so they treat it specifically.

answered Jul 20 '20 at 14:34

Eugene

117,005
15
201
306

Are you sure the "always check first node" is a performance optimization, not just correctness / how they implement chaining? It might be an optimistic early check for the common case, but it's hard to tell from just this code. (And you're not definitively stating that.) – Peter Cordes Jul 20 '20 at 14:50
@PeterCordes I always thought it's both, sort of. But now that you asked me and I looked again at the code this does look like an implementation pattern, more. – Eugene Jul 20 '20 at 15:03

What role do branch mispredictions play in hash table lookup performance?

3 Answers3