Normally, alternative 2 is faster because it's less machine code executing, and the store buffer will decouple unconditional stores from other parts of the core, even if they miss in cache.
If alternative 1 was consistently faster, compilers would make asm that did that, but it's not so they don't. It introduces a possible branch miss and a load that can cache-miss. There are plausible circumstances under which it could be better (e.g. false sharing with other threads, or breaking a data dependency), but those are special cases that you'd have to confirm with performance experiments and perf counters.
Reading variable
in the first place already touches memory for both variables (if neither is in registers). If you expect new_val
to almost always be the same (so it predicts well), and for that load to miss in cache, branch prediction + speculative execution can be helpful to decouple later reads of variable
from that cache-miss load. But it's still a cache miss load that has to get waited for because the branch condition can be checked, so the total miss penalty could end up being quite large if the branch predicts wrong. But otherwise you're hiding a lot of the cache-miss load penalty by making more later work independent of it, allowing OoO exec up to the limit of the ROB size.
Other than breaking the data dependency, if f()
inlines and variable
optimizes into a register, it would be pointless to branch. Otherwise, a store that misses in L1d but hits in L2 cache is still pretty cheap, and decoupled from execution by the store buffer. (Can a speculatively executed CPU branch contain opcodes that access RAM?) Even hitting in L3 is not too bad for a store, unless other threads have the line in shared state and dirtying it would interfere with them reading values of other global vars. (False sharing)
Note that later reloads of variable
can use the newly-stored value even while the store is waiting to commit from the store buffer to L1d cache (store forwarding), so even if f()
didn't inline and use the new_value
load result directly, its use of variable
still doesn't have to wait for a possible store miss on variable
.
Avoiding false-sharing is one of the few reasons it could be worth branching to avoid a single store of a value that fits in a register.
Two questions linked in comments by @EOF discuss a case of this possible optimization (or possible pessimization) to avoid writes. It's sometimes done with std::atomic
variables because false sharing is an even bigger deal. (And stores with the default mo_seq_cst
memory order are slow on most ISAs other than AArch64, draining the store buffer.)