An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

Question

A following-up question for Why does this `std::atomic_thread_fence` work

As a dummy interlocked operation is better than _mm_mfence, and there are quite many ways to implement it, which interlocked operation and on what data should be used?

Assume using an inline assembly that is not aware of surrounding context, but can tell the compiler which registers it clobbers.

score 1 · Accepted Answer · answered Jun 12 '20 at 04:23

1

Short answer for now, without going into too much detail about why. See specifically the discussion in comments on that linked question.

lock orb $0, -1(%rsp) is probably a good bet to avoid lengthening dependency chains for local vars that get spilled/reloaded. See https://shipilev.net/blog/2014/on-the-fence-with-dependencies for benchmarks. On Windows x64 (no red zone), that space should be unused except by future call or push instructions.

Store forwarding to the load side of a locked operation might be a thing (if that space was recently used), so keeping the locked operation narrow is good. But being a full barrier, I don't expect there can be any store forwarding from its output to anything else, so unlike normal, a narrow (1 byte) lock orb doesn't have that downside.

mfence is pretty crap compared to a hot line of stack space even on Haswell, probably worse on Skylake where it even blocks OoO exec. (And also bad on AMD compared to lock add).

answered Jun 12 '20 at 04:23

Peter Cordes

328,167
45
605
847

Is saving stack variable useful at all? If not that much, maybe `lock not` mentioned [in MSVC PR comment](https://github.com/microsoft/STL/issues/739#issuecomment-618737630) is the way to go? – Alex Guteniev Jun 12 '20 at 04:29
Another direction to consider is that _A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction._ - see the answer on https://stackoverflow.com/questions/61336070/how-to-use-intel-tsx-with-c-memory-model, but of course only when TSX is great again – Alex Guteniev Jun 12 '20 at 04:39
1

@AlexGuteniev: reserving stack space for a variable nothing else uses would avoid lengthening loop-carried dep chains involving other vars. I was assuming you weren't doing that, because that's mostly a workaround for an insufficiently smart compiler. Also, you asked about inline asm. GNU C inline asm statements implicitly clobber `"cc"` so it's not useful to avoid clobbering FLAGS. Interesting point if you consider the compiler emitting it natively. – Peter Cordes Jun 12 '20 at 04:58
@AlexGuteniev: Interesting point about an empty TSX transaction. IDK if that would be any faster. And yes, that would only work for some target CPUs. – Peter Cordes Jun 12 '20 at 05:00
I did not know that for gcc asm it is still impossible to tell that it does not clobber `flags`. (I'm looking at future potential porting of my program in Linux, or compiling it with clang under Windows, but I don't have much experience with gcc asm, since msvc doesn't have it) – Alex Guteniev Jun 12 '20 at 05:07
1

@AlexGuteniev: I don't think you'd want to actually use inline asm for this unless the performance delta was really significant. Long term you'd just change the compiler's internal x86 / x86-64 recipe for `thread_fence`, which I think is a hard-coded string of asm that it just emits. But in GCC internals, not inline asm, I think you *could* meaningfully omit a `"cc"` clobber. IDK, there's probably no real downside to using your own `atomic_thread_fence` with inline asm; GCC isn't smart enough to optimize away redundent `mfence` anyway for e.g. `seq_cst` store + `thread_fence` back to back – Peter Cordes Jun 12 '20 at 05:24
I've added my own answer, mostly to discuss `lock not [esp-1]` on Windows, and a deliberate extra store of a dummy variable to benefit from store forwarding – Alex Guteniev Jun 13 '20 at 08:20
1

As I see now that your `lock orb $0, -1(%rsp)` is perfect as non-context non-variable solution, will you go back suggesting it to gcc? – Alex Guteniev Jun 13 '20 at 11:13
@AlexGuteniev: I think they ruled it out based on pain for code analysis tools; IDK if a small perf improvement would be considered to outweigh that; probably not. – Peter Cordes Jun 13 '20 at 11:16
Boost.Atomic maintainer observed that gcc does not emit `mov+`mfence` anymore for seq cst store ([code comment link](https://github.com/boostorg/atomic/blob/b36797be8d7e9f8084391de279a88dad35484afb/include/boost/atomic/detail/ops_gcc_atomic.hpp#L101-L112)). Maybe they are willing to reconsider the thread fence either, as `mfence` becomes only worse. And if that analysis tool is concerned with some variable aliasing, then an extra variable solution could be an option. – Alex Guteniev Jun 13 '20 at 11:32

Alex Guteniev · Answer 2 · 2020-06-16T03:51:33.830

When going the route of interlocked operation on dummy location, there are few things to consider:

Being in L1d of this core,
Being not used by other cores
Not creating long dependency chains
Avoid stall due to store-forwarding miss

Without the context, anything is only a guess, so the goal is to make a best guess.

A place near top of stack is a good guess for 1 and 2.

Deliberately allocated stack variable is likely to fix 3, and as there isn't other stores in flight, 4 is not a problem. The best operation looks like lock not.

Not allocating stack variable requires the operation to be effectively no-op, so lock or [mem], 0 is a good option. Operand should be byte to avoid problems with 4. For 3, it is always a guess. (Although return address could have been used, assembly without the context does not know it. But MSVC _AddressOfReturnAddress may be a good idea)

I've read about red zone. Absence of it on Windows enable extra optimizations.

lock not byte ptr [esp-1] without extra variable is good on Windows, since the data is considered volatile an should not be used. There are no spilled registers, so no false data dependency.

ABI with 128 bytes red zone preclude the use of lock not byte ptr [esp-1]. 128 bytes beyond the stack is likely enough to be not L1d. Still, since red zone not that much likely to be used as the usual stack, the answer given by @Peter Cordes looks good.

TSX is primarily questionable due to its absence (unsupported on a given CPU, or disabled as a result of errata fix or security mitigation). Only RTM will exist in foreseen future (Has Hardware Lock Elision gone forever due to Spectre Mitigation?). According to RTM overview, an empty RTM transaction is still a fence, so it can be used.

A successfully committed RTM region consisting of an XBEGIN followed by an XEND, even with no memory operations in the RTM region, has the same ordering semantics as a LOCK prefixed instruction.

Beware of failed transactions or unsupported RTM. Pseudocode seem to be as follows:

if (rtm_supported && _xbegin() == 0xFFFFFFFF)
  _xend();
else
  dummy_interlocked_op();

You're forgetting about the cost in instructions / uops. An extra store seems like pure downside. If there is any in-flight store to that byte, it can *already* store-forward to the speculative early load part of `lock not byte ptr [esp-1]`. Modern CPUs can forward from a wide store to a byte reload of any of its bytes. https://blog.stuffedcow.net/2014/01/x86-memory-disambiguation/. But most of the time there won't be. Having the early load just hit in L1d is just as good as store-forwarding; adding another store to be drained is pure downside. — Peter Cordes, Jun 13 '20 at 10:41
TL:DR: `byte` operand size already makes sure that store-forwarding stalls aren't a problem from whatever might possibly have been in flight beforehand. That's not the case with wider operand-size, but even then reserving your own variable would mean any problem couldn't happen in a tight loop. — Peter Cordes, Jun 13 '20 at 10:44
You missed one instance of suggesting a `mov` store; I fixed that for you. Now yes, it's better. — Peter Cordes, Jun 13 '20 at 11:01

An implementation of std::atomic_thread_fence(std::memory_order_seq_cst) on x86 without extra performance penalties

2 Answers2

Linked