what does operations like the following do behind the scenes
As far as how it's implemented internally, CPU hardware arbitrates which core gets ownership of the cache line when there's contention. See Can num++ be atomic for 'int num'? for a C++ and x86 asm / cpu-architecture explanation of the details.
Re: why CPUs and compilers want to load early and store late:
see Java instruction reordering and CPU memory reordering
Atomic RMW prevents that, so do seq_cst
store semantics on most ISAs where you do a plain store and then a full barrier. (AArch64 has a special interaction between stlr
and ldar
to prevent StoreLoad reordering of seq_cst operations, but still allow reordering with other operations.)
Isn't it obvious that when a programmer write something like the following inside a multi-thread method [...]
What does that even mean? It's not running the same method in multiple threads that's a problem, it's accessing shared data. How is the compiler supposed to know which data will be accessed non-readonly from multiple threads at the same time, not inside a critical section?
There's no reasonable way to prove this in general, only in some simplistic cases. If a compiler were to try, it would have to be conservative, erring on the side of making more things atomic, at a huge cost in performance. The other kind of mistake would be a correctness problem, and if that could just happen when the compiler guesses wrong based on some undocumented heuristics, it would make the language unusable for multi-threaded programs.
Besides that, not all multi-threaded code needs sequential consistency all the time; often acquire/release or relaxed atomics are fine, but sometimes they aren't. It makes a lot of sense for programmers to be explicit about what ordering and atomicity their algorithm is built on.
Also you carefully design lock-free multi-threaded code to do things in a sensible order. In C++, you don't have to use Interlock...
, but instead you make a variable std::atomic<int> shared_int;
for example. (Or use std::atomic_ref<int>
to do atomic operations on variables that other code can access non-atomically, like using Interlocked functions).
Having no explicit indication in the source of which operations are atomic with what ordering semantics would make it harder to read and maintain such code. Correct lock-free algorithms don't just happen by having the compiler turn individual operators into atomic ops.
Promoting every operation to atomic would destroy performance. Most data isn't shared, even in functions that access some shared data structures.
Atomic RMW (like x86 lock add [rdi], eax
) is much slower than a non-atomic operation, especially since non-atomic lets the compiler optimize variables into registers.
An atomic RMW on x86 is a full memory barrier, so making every operation atomic would destroy memory-level parallelism every time you use a +=
or ++
.
e.g. one per 18 cycle throughput on Skylake for lock xadd [mem], reg
if hot in L1d cache, vs. one per 0.25 cycles for add reg, reg
(https://uops.info), not to mention removing opportunities to optimize away and combine operations. And reducing the ability for out-of-order execution to overlap work.