Branch-mispredictions versus cache misses

Question

Consider the following two alternative pieces of code:

Alternative 1:

if (variable != new_val) // (1)
    variable = new_val;

f(); // This function reads `variable`.

Alternative 2:

variable = new_val; // (2)
f(); // This function reads `variable`.

Which alternative is "statistically" faster? Assume variable is in cache L1 before (1) or (2).

I guess that alternative (1) is faster even if the branch-misprediction rate is high, but I don't really know the costs of "ifs". My guess is based on the assumption that cache-misses are way more expensive than branch-mispredictions but I don't really know.

What if variable wasn't in cache before (1) or (2)? Does it change the situation too much?

NOTE: Since the situation could change a lot among different CPUs, you can based your answer in an architecture you are familiar with, although widely used CPUs like any modern Intel architecture is preferred. The goal of my question is actually to know a bit more about how CPUs work.

Alt 1 can include alternative 2, as out of order execution, in which case the result is just discarded when the predicate does not hold. Based on this, I'd say that Alternative 2 is almost always more efficient. Efficiency is hard to pin point at this fine grain even with micro-benchmarks, since you'd have to also consider the side effects to the rest of the program, e.g., the mere act of prefetching assigns more workload to the prefetcher. Another point is that when doing the comparison you've already placed your variables in registers which would be a big part of the assignment alternative — Lorah Attkins, Oct 25 '21 at 16:45
Why do you think that version 1 could have less cache misses than version 2? — Mike Vine, Oct 25 '21 at 16:47
(1) is dependant on the previous value of `new_val`, which will require fetching it from cache if needed, whereas the compiler is allowed to completely disregard previous values in (2). I'd be surprised if (1) is faster unless the type of `variable` has a large `sizeof()` or has some side-effect producing assignment operations. But as always: don't assume, benchmark. — , Oct 25 '21 at 16:48
In a multi-threading environment where `variable` represents an atomic variable of shared data between several threads with high thread contention, I would guess that alternative #1 is faster, because writes to shared data are very cache-unfriendly. However, in most situations, I would say that alternative #2 is faster. Either way, I recommend that you use benchmarking to determine which method is best in your particular situation. — Andreas Wenzel, Oct 25 '21 at 16:51
I don't understand. They are functionally not the same. The first example involves a branch, verses the second which doesn't. The second version should be faster because branch prediction isn't initiated; it's a data instruction and processors love data instructions. Invoking branch prediction takes, time whereas not invoking branch prediction has no addition time. — Thomas Matthews, Oct 25 '21 at 17:00
@Peregring-lk the cost of misprediction can be very high. Take into consideration pipeline flush. — 0___________, Oct 25 '21 at 17:04
Print out the assembly language. Note the difference in instructions. — Thomas Matthews, Oct 25 '21 at 17:04
Remember, that `variable` can be placed into a register and thus affects whether the *variable* is cached or not. In my understanding, registers don't involve using the cache, except to load and store values. Thus there is a possibility that `f()` doesn't use the cache because the value is still in a registers. Depends on *when* the `variable` is used in `f()` and how the compiler generated the instructions. — Thomas Matthews, Oct 25 '21 at 17:07
There are some processors which can conditionally execute instructions (like the ARM). In example 1, the assignment instruction could be a conditional instruction and either executed or ignored based on the status codes and the condition on the the data instruction. — Thomas Matthews, Oct 25 '21 at 17:10
@ThomasMatthews How would the assembly code help? Branch prediction happens in hardware, not instructions. — Barmar, Oct 25 '21 at 17:10
@Barmar: Assembly language shows whether or not to invoke branch prediction. Usually branch prediction can be started using a compare instruction or a conditional branch instruction. The assembly language of the first example should show a branch instruction; whereas the second example should be void of the branch instruction (before the function call). This is what I mean by examining the assembly language. — Thomas Matthews, Oct 25 '21 at 17:14
@ThomasMatthews Obviously there will be a conditional branch in the first vesion. The question is whether the speed-up from correct branch prediction makes up for the time it takes to do the comparison. — Barmar, Oct 25 '21 at 17:16
Note well that a C compiler would be free in many cases to compile both examples to the same machine code. And that common machine code might or might not include a conditional branch. — John Bollinger, Oct 25 '21 at 18:20
Does this answer your question? [Strange optimization? in \`libuv\`. Please explain](https://stackoverflow.com/questions/41067315/strange-optimization-in-libuv-please-explain) — EOF, Oct 25 '21 at 18:48
Could also be considered a duplicate of [this](https://stackoverflow.com/q/22717010/3185968) — EOF, Oct 25 '21 at 18:50
@JeremyFriesner Assume any primitive type. For example `int` or `uint64_t`, whatever it typedefs to. — ABu, Oct 26 '21 at 06:55

Peter Cordes · Answer 1 · 2021-10-26T08:55:37.320

Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.

If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.

Reading variable in the first place already touches memory for both variables (if neither is in registers). If you expect new_val to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.

Other than breaking the data dependency, if f() inlines and variable optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)

Note that later reloads of variable can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f() didn't inline and use the new_value load result directly, its use of variable still doesn't have to wait for a possible store miss on variable.

Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.

Two questions linked in comments by @EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst memory order are slow on most ISAs other than AArch64, draining the store buffer.)

Branch-mispredictions versus cache misses

1 Answers1

Linked