Which is a better write barrier on x86: lock+addl or xchgl?

Question

The Linux kernel uses lock; addl $0,0(%%esp) as write barrier, while the RE2 library uses xchgl (%0),%0 as write barrier. What's the difference and which is better?

Does x86 also require read barrier instructions? RE2 defines its read barrier function as a no-op on x86 while Linux defines it as either lfence or no-op depending on whether SSE2 is available. When is lfence required?

score 12 · Answer 1 · answered Nov 20 '10 at 22:00

Quoting from the IA32 manuals (Vol 3A, Chapter 8.2: Memory Ordering):

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles [..]

Reads are not reordered with other reads

Writes are not reordered with older reads

Writes to memory are not reordered with other writes, with the exception of

writes executed with the CLFLUSH instruction

streaming stores (writes) executed with the non-temporal move instructions ([list of instructions here])

string operations (see Section 8.2.4.1)

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions

Reads cannot pass LFENCE and MFENCE instructions

Writes cannot pass SFENCE and MFENCE instructions

Note: The "In a single-processor system" above is slightly misleading. The same rules hold for each (logical) processor individually; the manual then goes on to describe the additional ordering rules between multiple processors. The only bit about it pertaining to the question is that

Locked instructions have a total order.

In short, as long as you're writing to write-back memory (which is all memory you'll ever see as long as you're not a driver or graphics programmer), most x86 instructions are almost sequentially consistent - the only reordering an x86 CPU can perform is reorder later (independent) reads to execute before writes. The main thing about the write barriers is that they have a lock prefix (implicit or explicit), which forbids all reordering and ensures that the operations is seen in the same order by all processors in a multi-processor system.

Also, in write-back memory, reads are never reordered, so there's no need for read barriers. Recent x86 processors have a weaker memory consistency model for streaming stores and write-combined memory (commonly used for mapped graphics memory). That's where the various fence instructions come into play; they're not necessary for any other memory type, but some drivers in the Linux kernel do deal with write-combined memory so they just defined their read-barrier that way. The list of ordering model per memory type is in Section 11.3.1 in Vol. 3A of the IA-32 manuals. Short version: Write-Through, Write-Back and Write-Protected allow speculative reads (following the rules as detailed above), Uncachable and Strong Uncacheable memory has strong ordering guarantees (no processor reordering, reads/writes are immediately executed, used for MMIO) and Write Combined memory has weak ordering (i.e. relaxed ordering rules that need fences).

Which one is faster? why using lock;addl vs fence? – Alexis Dec 13 '21 at 12:56 — Alexis, Dec 13 '21 at 12:56

score 9 · Accepted Answer · edited Aug 14 '21 at 08:32

9

The "lock; addl $0,0(%%esp)" is faster in case that we testing the 0 state of lock variable at (%%esp) address. Because we add 0 value to lock variable and the zero flag is set to 1 if the lock value of variable at address (%%esp) is 0.

lfence from Intel datasheet:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible.

(Editor's note: mfence or a locked operation is the only useful fence (after a store) for sequential consistency. lfence does not block StoreLoad reordering by the store buffer.)

For instance: memory write instruction like 'mov' are atomic (they don't need lock prefix) if they are properly aligned. But this instruction is normally executed in CPU cache and will not be globally visible at this moment for all other threads, because memory fence must be performed first to make this thread wait until previous stores are visible to other threads.

So the main difference between these two instructions is that xchgl instruction will not have any effect on the conditional flags. Certainly we can test the lock variable state with lock cmpxchg instruction but this is still more complex than with lock add $0 instruction.

edited Aug 14 '21 at 08:32

devoured elysium

101,373
131
340
557

answered Nov 20 '10 at 21:58

GJ.

10,810
2
45
62

1

If I write to shared memory and call `lock; addl $0,0(%%esp)` or `sfence`, do I need to call `lfence` in the other process/thread before reading the memory? Or does the lock/sfence instruction by itself already guarantee that other CPUs see the data? – Hongli Nov 21 '10 at 09:42
3

Yes, lock prefix guarantee that result of instruction is immediately globaly visible. – GJ. Nov 21 '10 at 12:12
1

Suppose the CPU supports SSE but not SSE2. I use `sfence` but cannot use `lfence`. Do I need to use `lock; add` as read barrier, or can I get away with not using a read barrier? – Hongli Nov 21 '10 at 17:41
1

Depend of haw and in which ring your instructions is executed. Instruction lfence is normally used in kernel (ring 0). If CPU don't support lfence instruction than program applications and threads must to use sfence after lock performed with mov, because kernel can interrupt program applications and threads after any CPU instruction and changed data memory and instructions can be still in cache. So you can use "lock add $0,..." in kernel and "mov $1,... sfence" in program applications and threads. – GJ. Nov 21 '10 at 19:05
1

My instructions are executed in userspace. So if I use 'lock; add' as write barrier, then on the reading side I don't have to use any special read barrier instruction, and a simple compiler barrier will suffice, right? – Hongli Nov 24 '10 at 11:03
Yes, but if you have more writers, than you must to use "lock cmpxchg" and after instructon execution test the zero flag if instruction was successful, because in case that some other thread use the reserved memory region you must wait him to finish work. – GJ. Nov 24 '10 at 12:42
You can also use "lock xadd" instead "lock add", because the return value after instruction execution is privius state of lock. – GJ. Nov 24 '10 at 12:45
`lfence` is never useful for memory ordering unless you're reading from video RAM (or some other WC weakly-ordered region) with MOVNTDQA loads. Serializing out-of-order execution (but not the store buffer) isn't useful to stop StoreLoad reordering (the only kind that x86's strong memory model allows for normal WB (write-back) memory regions). You need either a locked instruction, xchg-mem, or `mfence` (not `lfence`). This is why people use `lock addl $0, (%esp)` as a stand-alone fence with no side-effects (except clobbering FLAGS) when you aren't already doing any other atomic RMW ops. – Peter Cordes Oct 20 '18 at 22:16
1

As @PeterCordes said, this answer misunderstands the point of the operations in the context the OP is asking about. They are _write barriers_, and the atomic operation is only "no op" side effect, in order to get the memory ordering effect. Essentially, a cheaper way to do an `mfence`. The `addl` is a no op because it adds zero, which does nothing. The `xchg` isn't obviously a no-op based on the small snippet provided, but if you look into the RE2 library, perhaps it is to a dummy location, or perhaps the value is already known to contain that value. Otherwise, the two aren't even comparable! – BeeOnRope Oct 21 '18 at 00:14
@BeeOnRope: Or maybe RE2 is using `xchg` to do a seq_cst store + barrier, like compilers do, like I described in my answer. – Peter Cordes Oct 21 '18 at 16:41
1

@Peter, yes I considered that, but the particular form `xchgl (%0),%0` doesn't seem like it could do that since the same placeholder is on both sides. It exchanges what is at a given address with the value of the address (I think?), which seems mostly useless except as a dummy operations for the ordering side effects. – BeeOnRope Oct 21 '18 at 16:55
@BeeOnRope: oh right, I forgot that detail. Yeah that is weird; probably there's a dummy variable, hopefully thread-private. – Peter Cordes Oct 21 '18 at 17:56

Peter Cordes · Answer 3 · 2023-02-05T17:16:54.967

lock addl $0, (%esp) is a substitute for mfence, not lfence.

(lock add is generally faster on modern CPUs, especially Intel Skylake with updated microcode where mfence acts like lfence as well, blocking out-of-order exec even of instructions on registers. That's why GCC recently switched to using a dummy lock add instead of mfence when it needs a full barrier.)

The use-case is when you need to block StoreLoad reordering (the only kind that x86's strong memory model allows), but you don't need an atomic RMW operation on a shared variable. https://preshing.com/20120515/memory-reordering-caught-in-the-act/

e.g. assuming aligned std::atomic<int> a,b, where the default memory_order is seq_cst

movl   $1, a           # a = 1;    Atomic for aligned a
# barrier needed here between seq_cst store and later loads
movl   b, %eax         # tmp = b;  Atomic for aligned b

Your options are:

Do a sequential-consistency store with xchg, e.g. mov $1, %eax / xchg %eax, a so you don't need a separate barrier; it's part of the store. I think this is the most efficient option on most modern hardware; C++11 compilers other than gcc use xchg for seq_cst stores. (See Why does a std::atomic store with sequential consistency use XCHG? re: performance and correctness.)
Use mfence as a barrier. (gcc used mov + mfence for seq_cst stores, but recently switched to xchg for performance.)
Use lock addl $0, (%esp) as a barrier. Any locked instruction is a full barrier, but this one has no effect on register or memory contents except FLAGS. See Does lock xchg have the same behavior as mfence?

(Or to some other location, but the stack is almost always private and hot in L1d, so it's a good candidate. Later reloads of whatever was using that space couldn't start until after the atomic RMW anyway because it's a full barrier.)

You can only use xchg as a barrier by folding it into a store because it unconditionally writes the memory location with a value that doesn't depend on the old value.

When possible, using xchg for a seq-cst store is probably best, even though it also reads from the shared location. mfence is slower than expected on recent Intel CPUs (Are loads and stores the only instructions that gets reordered?), also blocking out-of-order execution of independent non-memory instructions the same way lfence does.

It might even be worth using lock addl $0, (%esp)/(%rsp) instead of mfence even when mfence is available, but I haven't experimented with the downsides. Using -64(%rsp) or something might make it less likely to lengthen a data dependency on something hot (a local or a return address), but that can make tools like valgrind unhappy.

lfence is never useful for memory ordering unless you're reading from video RAM (or some other WC weakly-ordered region) with MOVNTDQA loads.

Serializing out-of-order execution (but not the store buffer) isn't useful to stop StoreLoad reordering (the only kind that x86's strong memory model allows for normal WB (write-back) memory regions).

The real-world use-cases for lfence are for blocking out-of-order execution of rdtsc for timing very short blocks of code, or for Spectre mitigation by blocking speculation through a conditional or indirect branch.

See also When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer) for more about why lfence is not useful, and when to use each of the barrier instructions. (Or in mine, the C++ intrinsics when programming in C++ instead of asm).

score 6 · Answer 4 · edited Jun 17 '15 at 14:51

6

As an aside to the other answers, the HotSpot devs found that lock; addl $0,0(%%esp) with a zero offset may not be optimal, on some processors it can introduce false data dependencies; related jdk bug.

Touching a stack location with a different offset can improve performance under some circumstances.

edited Jun 17 '15 at 14:51

Ross Ridge

38,414
7
81
112

answered Jun 17 '15 at 13:51

the8472

40,999
5
70
122

score 2 · Answer 5 · answered Nov 20 '10 at 19:31

2

The important part of lock; addl and xchgl is the lock prefix. It's implicit for xchgl. There is really no difference between the two. I'd look at how they assemble and choose the one that's shorter (in bytes) since that's usually faster for equivalent operations on x86 (hence tricks like xorl eax,eax)

The presence of SSE2 is probably just a proxy for the real condition which is ultimately a function of cpuid. It probably turns out that SSE2 implies the existence of lfence and the availability of SSE2 was checked/cached at boot. lfence is required when it's available.

answered Nov 20 '10 at 19:31

Ben Jackson

90,079
9
98
150

4

The instruction `lfence` is part of the SSE2 instruction set. It's not a proxy. – Pascal Cuoq Nov 20 '10 at 22:48
`lfence` is not required for memory ordering unless you're doing `movntdqa` weakly-ordered loads from WC memory (e.g. from video RAM). `mfence` is an alternative full barrier which you could substitute for `addl $0, (%esp)`, but `lfence` is not strong enough to stop StoreLoad reordering. You definitely never need both. (And BTW, `mfence` is quite slow and has a bigger impact on OoO exec than `xchg` or `lock`ed instruction on Intel CPUs: [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379)) – Peter Cordes Oct 20 '18 at 22:11

Which is a better write barrier on x86: lock+addl or xchgl?

5 Answers5

Linked

Related