4

After asking this question, I've understood that the atomic instruction, such as test-and-set, would not involve the kernel. Only if a process needs to be put to sleep (to wait to acquire the lock) or woken (because it couldn't acquire the lock but now can), then the kernel has to be involved to perform the scheduling operations.

If so, does it mean that the memory fence, such as std::atomic_thread_fence in c++11, won't also involve the kernel?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Yves
  • 11,597
  • 17
  • 83
  • 180
  • Can you explain *why* you want to know that implementation details? As user of a given system ( OS/compiler/libraries ) you can't change the behavior. And if you need synchronization of memory access, you have to use it. I simply want to know how any kind of answer will change your code or whatever... Thanks – Klaus Feb 12 '20 at 09:43
  • @Klaus because of optimization. A synchronization primitive based on assembler have latency of ~100 cycles (order of magnitude of main memory access, it depends on many factors however). If kernel scheduling is involved it can take milliseconds. There are real time applications in some contexts where kernel scheduling can be too long, and it's preferrable to spin-lock instead of yield()-ing. – Sigi Feb 12 '20 at 10:04
  • @Klaus Well, I'm studying the difference between `mutex` and `atomic` with C++11, especially their performance. If `atomic` and memory fence won't involve the kernel, at least I can make sure that `atomic` + memory fence will generate less context switch. – Yves Feb 12 '20 at 10:06
  • @Sigismondo: Quite clear that changing context to kernel mode will take much longer than some assembler instructions and the resulting action slike cache lines syncs an so on. But the user must use the synchronization so there is no way to change the design in any case... – Klaus Feb 12 '20 at 10:06
  • ok, but for all questions you have to tell us which OS/compiler/libs you are using as all this is specific to a given platform. On small embedded devices you typically have no assembler instructions for memory syncs at all. So there is a must to forward such fences to some kind of underlaying OS calls. For typically given Xx86 CPU can handle it on assembler level. So yes, `std::atomic_thread_fence`, it will on X86 typically not call any OS functions. But you can simply look your self on the generated code from your compiler. You will see all OS calls if the compiler emits some. – Klaus Feb 12 '20 at 10:10
  • you can choose whether to use pthread_wait() or spin-loop. And it's with spin-looping that's where the many different memory fences (on an architecture that supports this level of refinement in synchronization, eg ARM) become useful in optimization. It's clear however that this is extremely useful to who is implementing the kernel synchronizations themselves - see refs in my answer – Sigi Feb 12 '20 at 10:21

2 Answers2

6

std::atomic doesn't involve the kernel1

On almost all normal CPUs (the kind we program for in real life), memory barrier instructions are unprivileged and get used directly by the compiler. The same way compilers know how to emit instructions like x86 lock add [rdi], eax for fetch_add (or lock xadd if you use the return value). Or on other ISAs, literally the same barrier instructions they use before/after loads, stores, and RMWs to give the required ordering. https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

On some arbitrary hypothetical hardware and/or compiler, anything is of course possible, even if it would be catastrophically bad for performance.

In asm, a barrier just makes this core wait until some previous (program-order) operations are visible to other cores. It's a purely local operation. (At least, this is how real-word CPUs are designed, so that sequential consistency is recoverable with only local barriers to control local ordering of load and/or store operations. All cores share a coherent view of cache, maintained via a protocol like MESI. Non-coherent shared-memory systems exist, but implementations don't run C++ std::thread across them, and they typically don't run a single-system-image kernel.)

Footnote 1: (Even non-lock-free atomics usually use light-weight locking).

Also, ARM before ARMv7 apparently didn't have proper memory barrier instructions. On ARMv6, GCC uses mcr p15, 0, r0, c7, c10, 5 as a barrier.
Before that (g++ -march=armv5 and earlier), GCC doesn't know what to do and calls __sync_synchronize (a libatomic GCC helper function) which hopefully is implemneted somehow for whatever machine the code is actually running on. This may involve a system call on a hypothetical ARMv5 multi-core system, but more likely the binary will be running on an ARMv7 or v8 system where the library function can run a dmb ish. Or if it's a single-core system then it could be a no-op, I think. (C++ memory ordering cares about other C++ threads, not about memory order as seen by possible hardware devices / DMA. Normally implementations assume a multi-core system, but this library function might be a case where a single-core only implementation could be used.)


On x86 for example, std::atomic_thread_fence(std::memory_order_seq_cst) compiles to mfence. Weaker barriers like std::atomic_thread_fence(std::memory_order_release) only have to block compile-time reordering; x86's runtime hardware memory model is already acq/rel (seq-cst + a store buffer). So there aren't any asm instructions corresponding to the barrier. (One possible implementation for a C++ library would be GNU C asm("" ::: "memory");, but GCC/clang do have barrier builtins.)

std::atomic_signal_fence only ever has to block compile-time reordering, even on weakly-ordered ISAs, because all real-world ISAs guarantee that execution within a single thread sees its own operations as happening in program order. (Hardware implements this by having loads snoop the store buffer of the current core). VLIW and IA-64 EPIC, or other explicit-parallelism ISA mechanisms (like Mill with its delayed-visibility loads), still make it possible for the compiler to generate code that respects any C++ ordering guarantees involving the barrier if an async signal (or interrupt for kernel code) arrives after any instruction.


You can look at code-gen yourself on the Godbolt compiler explorer:

#include <atomic>
void barrier_sc(void) {
    std::atomic_thread_fence(std::memory_order_seq_cst);
}

x86: mfence.
POWER: sync.
AArch64: dmb ish (full barrier on "inner shareable" coherence domain).
ARM with gcc -mcpu=cortex-a15 (or -march=armv7): dmb ish
RISC-V: fence iorw,iorw

void barrier_acq_rel(void) {
    std::atomic_thread_fence(std::memory_order_acq_rel);
}

x86: nothing
POWER: lwsync (light-weight sync).
AArch64: still dmb ish
ARM: still dmb ish
RISC-V: still fence iorw,iorw

void barrier_acq(void) {
    std::atomic_thread_fence(std::memory_order_acquire);
}

x86: nothing
POWER: lwsync (light-weight sync).
AArch64: dmb ishld (load barrier, doesn't have to drain the store buffer)
ARM: still dmb ish, even with -mcpu=cortex-a53 (an ARMv8) :/
RISC-V: still fence iorw,iorw

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

In both this question and the referenced one you are mixing:

  • synchronization primitives, in the assembler scope, like cmpxchg and fences
  • process/thread synchronizations, like futexes

What does it means "it involves the kernel"? I guess you mean "(p)threads synchronizations": the thread is put to sleep and will awoken as soon as the given condition is met by another process/thread.

However, test-and-set primitives like cmpxchg and memory fences are functionalities provided by the microprocessor assembler. The kernel synchronization primitives are eventually based on them to provide system and processes synchronizations, using shared state in kernel space hidden behind kernel calls.

You can look at the futex source to get evidence of it.

But no, memory fences don't involve the kernel: they are translated into simple assembler operations. As the same as cmpxchg.

Sigi
  • 4,826
  • 1
  • 19
  • 23