Why do we need both read and write barriers?

Question

Why do we need to define two types of barriers with the same implementation?

For example, this code from io_uring in Linux:

#if defined(__x86_64) || defined(__i386__)
#define read_barrier()  __asm__ __volatile__("":::"memory")
#define write_barrier() __asm__ __volatile__("":::"memory")
#else

Marco Bonelli · Accepted Answer · 2020-05-04T13:01:05.617

5

The real answer is: because x86's memory model is already strong enough that blocking compile-time reordering is sufficient for load or store ordering; runtime reordering is already blocked by hardware.

Those are just generic compile-time barriers made through a piece of inline assembly that, if used, prevents GCC from reordering instructions. It's explained pretty well in this other post. What can be achieved using this "trick" is usually also possible using the C volatile qualifier.

Note that the Linux kernel does not use those specific macros anywhere in the code, those are just two macros defined for io_uring userspace test tools. It definitely uses asm volatile ("" ::: "memory") where needed, but under different names (e.g. smp_rmb(), smp_wmb()).

x86's memory model makes sfence and lfence entirely useless for communication between CPUs; blocking compile-time reordering is sufficient: see Does the Intel Memory Model make SFENCE and LFENCE redundant?

smp_mb() is a full barrier and does need an actual asm instruction, as well as blocking compile-time reordering.

x86 does have some memory barrier asm instructions for read-only and write-only "real" (runtime) memory barriers. Those are sfence (store fence), lfence (load fence) and mfence (memory fence = full barrier).

mfence serializes both read and writes (full barrier) while the others only serialize one of the two (reads OR writes a.k.a loads OR stores). The wikipedia page on memory ordering does a decent job of explaining the meaning of those. lfence actually blocks LoadStore reordering, not just LoadLoad, for weakly-ordered movntdqa loads from WC memory. Reordering of other kinds of loads from other memory types are already disallowed so there's almost never any reason to actually use lfence for memory ordering, instead of its other effect of blocking out-of-order exec.

The kernel uses those actual asm instructions for memory barriers in I/O code, for example mb(), rmb() and wmb() which expand exactly to mfence, lfence, sfence, and others (example).

sfence and lfence are probably overkill in most cases, for example around MMIO to strongly-ordered UC memory. Writing to WC memory could actually need an sfence. But they're not too slow compared to I/O, and there might be some cases that would be a problem otherwise, so Linux takes the safe approach.

In addition to this, x86 has different kind of read/write barriers which may also be faster (such as the one I linked above). See the following answers for more about full barriers (what C11 calls sequential consistency) with either mfence or a dummy locked instruction:

edited May 04 '20 at 13:01

answered Apr 20 '20 at 01:50

Marco Bonelli

63,369
21
118
128

Unless the kernel does any `movnti` stores, it doesn't need `sfence`. And unless it does any `movntdqa` SSE4.1 loads from WC memory, it doesn't need `lfence` for memory order. [Does the Intel Memory Model make SFENCE and LFENCE redundant?](https://stackoverflow.com/q/32705169) (spoiler alert: yes). Blocking compile-time reordering gives you the underlying asm memory model. On x86 that's program order + a store buffer with store-forwarding, which is stronger than C++11 acq_rel. `mfence` is only needed in the rare cases where you need a full barrier. – Peter Cordes Apr 20 '20 at 02:03
Linux's real `wmb()` and `rbm()` macros for internal use by the kernel do eventually expand to `asm("":::"memory")` on x86 to block StoreStore or LoadLoad (and LoadStore) reordering, i.e. get release / acquire (not seq_cst) semantics on top of `volatile`. Only an `mb()` full barrier needs any asm instructions. Note that the `CONFIG_X86_PPRO_FENCE` stuff is some weird weakly-ordered x86 model that nothing uses. IDK if hardware as weak as that model ever existed IRL, but modern CPUs are definitely not like that. IIRC that option was removed from Linux recently because it's just noise. – Peter Cordes Apr 20 '20 at 02:03
@PeterCordes Sure about that? I don't see that, see [here for example](https://elixir.bootlin.com/linux/v4.9/source/arch/x86/include/asm/barrier.h). I mean, it could surely use `asm("":::"memory")` somewhere, but I didn't spot it defined with that name. – Marco Bonelli Apr 20 '20 at 02:09
Yes, I'm absolutely 100% certain that x86's hardware memory model is program-order + store forwarding, and that `sfence` is basically a no-op with no NT stores in flight (AMD gives SFENCE more semantics like MFENCE but Intel doesn't). I'm also certain that last time I traced the chain of Linux's macros for macros, `smp_rmb()` and `smp_wmb()` were pure compiler barriers, and that C11 `atomic_thread_fence(mo_release)` compiles to zero asm instructions. (Linux's `rmb` and `wmb` might be something else; I forget about the smp_ difference. Those are for I/O, as opposed to the SMP ones). – Peter Cordes Apr 20 '20 at 02:13
So yes, I named the wrong macros in my 2nd comment, I should have said `smp_rmb`. BTW, https://elixir.bootlin.com/linux/v5.6.5/source/arch/x86/include/asm/barrier.h has that CONFIG_X86_PPRO_FENCE crap removed. And yes, `#define __smp_wmb() barrier()`, where barrier is just that compiler barrier. – Peter Cordes Apr 20 '20 at 02:17
@PeterCordes oh yea sure, I found those `smp_{r,w}mb()` [expanding to compile-time barriers](https://elixir.bootlin.com/linux/v4.9/source/include/linux/compiler-gcc.h#L15). Didn't know about that config being removed though it only makes sense that they did. Thanks for the comments as always. – Marco Bonelli Apr 20 '20 at 02:18
Note that x86 `lfence` is a LoadStore barrier, not just LoadLoad. Also, I made a very large edit; I felt like your last edit was very misleading and still made it sound like there was some reason to do more than block compile-time reordering. Maybe I should have just written my own answer and left a downvote on yours, let me know if you'd rather I did that. But if you don't mind collaborating on an answer, I think this one is fairly good now. – Peter Cordes Apr 20 '20 at 02:46
1

@PeterCordes I would have been okay either way, but thank you very much. I can see how my original answer could have been very misleading. Your edit makes a lot of sense and clarifies that for the general reader. I wasn't really trying to address the fact that x86 memory model makes `lfence` and `sfence` useless and allows for serialization through compile-time only ordering, but makes a lot of sense to explain it. – Marco Bonelli Apr 20 '20 at 02:57

score 2 · Answer 2 · answered Apr 19 '20 at 16:48

2

They happen to be the same on x86, but it's possible that on other architectures they'll be different. Thus, to make code portable, even x86 needs separate macros for them.

answered Apr 19 '20 at 16:48

Joseph Sible-Reinstate Monica

45,431
5
48
98

Do such architectures exist? – Y. A. Apr 19 '20 at 18:30
@Y.A.: Some weakly ordered ISAs have a selection of barriers. https://preshing.com/20120930/weak-vs-strong-memory-models/. e.g. https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html shows the standard mappings from C++11 `std::atomic` memory orders to asm, and PowerPC acquire loads use `ld; cmp; bc; isync` while store-release can use `lwsync; st`. But for acquire or release *fences* like `smp_rmb` and `smp_wmb`, both would use `lwsync` which blocks reordering other than StoreLoad. – Peter Cordes Apr 20 '20 at 02:21
@Y.A.: Or apparently ARMv8 can use `DMB ISH LD` (block load reordering) for acquire barriers vs. `dmb ish` (block everything) for a release barrier. Or SPARC (if you aren't using SPARC-TSO strong memory ordering) has `membar #LoadLoad | #LoadStore` or `membar #LoadStore | #StoreStore`. See https://preshing.com/20130922/acquire-and-release-fences/. – Peter Cordes Apr 20 '20 at 02:24

Why do we need both read and write barriers?

2 Answers2