Sequentially consistent fence

Question

I have kernel code and userspace code that synchronize on atomic variables. The kernel and userspace code may be running on the same logical core or different logical cores. Let's assume the architecture is x86_64.

Here is an initial implementation to get our feet wet:

Kernel (C)                        Userspace (C++)
---------------------------       -----------------------------------
Store A (smp_store_release)       Store B (std::memory_order_release)
Load B (smp_load_acquire)         Load A (std::memory_order_acquire)

I require that from the perspective of each thread, its own load happens after its own store. So for example, from userspace's perspective, the load to A must happen after the store to B.

Furthermore, I similarly require that for a given thread, it observes the other thread do the load after the store. So for example, from the kernel's perspective, it must observe that userspace stores to B before loading A.

Clearly, the code above is insufficient to meet these two requirements, so for the sake of this question, I rewrite it as so:

Kernel (C)                        Userspace (C++)
---------------------------       -----------------------------------
Store A (smp_store_release)       Store B (std::memory_order_release)
cpuid                             std::atomic_thread_fence(std::memory_order_seq_cst)
Load B (smp_load_acquire)         Load A (std::memory_order_acquire)

According to the Intel manual, cpuid is a serializing operation.

Here are my questions:

If I issue cpuid with the asm compiler-level memory barrier, does this have the same behavior as a sequentially consistent fence?
Now let's say I issue cpuid without the asm compiler-level memory barrier. Furthermore, let's say that the store to A is standard kernel code while the load to B is done by a BPF program. Does cpuid have the same behavior as a sequentially consistent fence in this case? My impression is that it does, because (1) cpuid provides hardware serialization and (2) compiler reordering is impossible since the kernel is compiled separately from the BPF program.
The C++ standard requires that synchronization occur between threads on the same address. It seems that issuing an mfence (or another type of fence) is sufficient to achieve hardware serialization, and mfence does not have a memory address as an argument. Thus, does the standard impose this requirement solely to prevent compiler reordering?

score 2 · Answer 1 · answered May 28 '23 at 04:52

Reordering, and serialization (ordering), are things that only apply within the current thread, for the global visibility of its accesses to coherent shared cache. See https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/. Barriers alone don't sync with another thread. Sufficient barriers can ensure that the order of operations is some interleaving of program order (e.g. if all atomic ops use seq_cst), not that some load in another thread actually will see some store in this thread; it might have already happened and you can't rewind time.

On x86, to do a store that can't reorder with any later loads (or anything else in any direction), use xchg. It has an implicit lock prefix that makes it a full barrier.

To get that from modern compilers, use B.store(value, std::memory_order_seq_cst), where seq_cst is the default anyway. (Older GCC used to compile that to a mov-store + mfence, which is equivalent: Why does a std::atomic store with sequential consistency use XCHG?)

To document in your source that you don't want the later load to reorder with that store, make it seq_cst as well, like A.load(std::memory_order_seq_cst).

ISO C++ guarantees that seq_cst operations are part of a global total order that's consistent with source order, so seq_cst operations don't allow StoreLoad reordering with each other.

That's actually important on some other ISAs, notably AArch64, where a seq_cst store can reorder with later loads that aren't seq_cst. Only stlr / ldar have the special interaction that prevents StoreLoad reordering. An acquire load using ldapr doesn't have to wait for older stlr stores to commit to L1d cache.

x86 doesn't (yet?) have hardware support for doing anything weaker than a full barrier after or as part of the store but still strong enough for seq_cst, unlike AArch64. But there's no reason to write the source with weaker operations like acquire and rely on asm details to give you the StoreLoad ordering you need: You can still just use B.store(val) ; A.load() with both using the default seq_cst (or make it explicit). That will compile safely and cheaply on x86 and everywhere else. (x86 seq_cst loads are just plain loads because the cost of avoiding StoreLoad reordering is already done in every seq_cst store. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

A seq_cst store is also a release operation, and a seq_cst load is also an acquire operation, in case you're worried about using seq_cst instead of std::memory_order_release.

If your kernel code doesn't support a seq_cst store operation, put a seq_cst fence / full barrier between its store and load. The same thing would work with std::atomic in C++, it's just more efficient to let the compiler include the barrier as part of the SC store, instead of needing a separate mfence or dummy lock add byte [rsp], 0 or something to implement a separate std::atomic_thread_fence(std::memory_order_seq_cst).

I have no idea why you think cpuid would be a good idea here since you only have data stores/loads, not cross-modifying code or anything tricky. Perhaps you're thinking that "serialize" implies affecting other cores? It doesn't. But to answer your questions:

Yes, cpuid is at least as strong a barrier as mfence, stronger in some ways (re: code fetch for cross-modifying code IIRC). When you say "asm compiler-level memory barrier", you're talking about asm("cpuid" ::: "memory", "eax", "ebx", "ecx", "edx"), a GNU C asm statement with a "memory" clobber? (And register clobbers on the regs CPUID writes, unlike the new serialize instruction in Sapphire Rapids IIRC, which probably also avoids being a vmexit the way CPUID always is in VMs.)
Yes, if there's no chance for compile-time reordering between the ops you care about, asm("mfence" ::: ); is sufficient, but not a good idea.
Serializing (ordering) the global visibility of memory operations on the current core or thread doesn't sync with another thread on its own. To sync-with another thread and create a happens-before relationship, you need a load in this thread to see a store done by the other thread. Specifically an acquire (or stronger) load seeing a release (or stronger) store, or equivalent barriers.

See https://preshing.com/20120913/acquire-and-release-semantics/ re: writing a buffer then data_ready.store(true, std::memory_order_release), so a reader that reads data_ready as true can safely read the buffer and see the earlier non-atomic stores done by the writer.

Just running a barrier instruction to order your own loads+stores after a data_ready.load() doesn't help anything if the writer is still in the middle of writing the buffer and hasn't stored to data_ready yet.

Memory fences aka barriers aren't like a pthread_barrier() synchronization operation where every thread reaching it waits until all threads have reached it. Totally different concept.

Sequentially consistent fence

1 Answers1

Linked