clwb+sfence, can we remove sfence if writes are cache-line aligned?

Question

As per information on clwb ordering (link),

"CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back. CLWB instruction need not be ordered by another CLWB or CLFLUSHOPT instruction. CLWB is implicitly ordered with older stores executed by the logical processor to the same address."

If the set of operations on an Intel X86-64 is as follows, Can I remove the "sfence" and still ensure correctness if the writes (A) and write(B) are cache-line aligned.

I am asking this since on Intel Write(A) and write(B) are ordered (TSO) and write(A)->clwb(A) and write(B)->clwb(B) are ordered as per above quoted description of clwb

write(A)
clwb(A)
sfence()
write(B)
clwb(B)

I am making following assumptions

compiler does not reorder these operations
clwb() instruction writes back the dirty line till the persistent domain, so write(A)->clwb(A) pair ensures that the modified value of A is in persistent domain

Please tell if removing sfence can break the correctness ? if yes , on what scenarios Thanks

For normal stores to WB memory that are both within the same cache line: yes persistence order matches x86-TSO global-observability order. [Is clflush or clflushopt atomic when system crash？](https://stackoverflow.com/q/65439089). Otherwise that's not guaranteed. What do you mean by cache-line aligned? Two separate 512-bit ZMM stores to two separate cache lines? — Peter Cordes, May 18 '21 at 03:50
Yes, I have seen the post that you referred , I meant store is cache-line aligned since clwb() works at a cache-line granularity. My question is whether clwb() ensures writing back dirty cache line till memory controller write buffer (I am assuming memory controller write queue is part of persistent domain). — Arun Kp, May 18 '21 at 04:15
So you mean A is fully contained within one cache line, and B within a separate one? If so, say that, not "cache-line aligned". I'm pretty sure without SFENCE, after a crash it would be possible to see the effect of B but not A. `clwb` isn't ordered, so the later one could make its store persistent first. That's what the manual is pointing out with clwb's lack of ordering wrt. normal stores. — Peter Cordes, May 18 '21 at 04:26
Yes I meant that, A and B are contained in separate cache lines, I am sorry for the confusion. You exactly pointed out my doubt, If effect of B is visible after a crash, that means clwb(B) happened. So according to TSO write(B) happened means write(A) happened (may be it is in store buffer). According to manual "CLWB is implicitly ordered with older stores executed by the logical processor to the same address.", ie clwb(A) happens only after write(A), so won't program order ensure clwb(A) happened considering write(B) happened and visible after a crash? thanks @Peter — Arun Kp, May 18 '21 at 05:07
I am seeing it as a transitive ordering , meaning write(A)->write(B) are ordered and write(B)->clwb(B) are ordered, so how can clwb(B) bypass write(B) [thus violating the order constrain of manual] and happen before clwb(A) , thus causing effect of clwb(B) visible after a crash and not clwb(A)? — Arun Kp, May 18 '21 at 05:12
Re: store buffer: no, x86-TSO ordering is about commit from store buffer to L1d, the pointer of global observability. That's of course totally separate from eventual write-back (via eviction or clwb) to DRAM. — Peter Cordes, May 18 '21 at 05:18
Thanks for the clarification, what may be the exact meaning this line then, "CLWB is implicitly ordered with older stores executed by the logical processor to the same address." — Arun Kp, May 18 '21 at 05:35
I was writing an answer, since your comments finally gave enough clues to what you were missing in the documentation. — Peter Cordes, May 18 '21 at 05:39

score 3 · Accepted Answer · answered May 18 '21 at 05:39

For normal stores to WB memory that are both within the same cache line: yes persistence order matches x86-TSO global-observability order, see Is clflush or clflushopt atomic when system crash？. Otherwise that's not guaranteed.

It seems you mean A is fully contained within one cache line, and B within a separate one.

Without SFENCE, after a crash it would be possible to see the effect of B but not A. clwb isn't ordered, so the later one could make its store persistent first. That's what the manual is pointing out with clwb's lack of ordering wrt. normal stores.

So according to TSO write(B) happened means write(A) happened (may be it is in store buffer).

No, x86-TSO ordering is about order of commit from store buffer to L1d, the pointer of global observability. That's of course totally separate from eventual write-back (via eviction or clwb) to DRAM. Store uops can execute (write their address+data to the store buffer) in any order, but can't commit until after retirement (i.e. when they're non-speculative). Additionally, that commit is restricted to happen in program order, i.e. the order store-buffer entries were allocated in during issue/rename/allocate.

meaning write(A)->write(B) are ordered and write(B)->clwb(B) are ordered, so how can clwb(B) bypass write(B) [thus violating the order constrain of manual] and happen before clwb(A) , thus causing effect of clwb(B) visible after a crash and not clwb(A)?

No, the "implicitly ordered with older stores ... to the same address" rule only guarantees that store + clwb to the same address will write-back a version of the line that includes that store-data. Otherwise it could write-back a copy of the line while the latest store was still in the store buffer or not even executed. It doesn't mean that the whole write-back has to finish before any later stores!

The order of operations that produces B but not A visible after a crash is the following:

A and B execute in some order
A and B commit to L1d cache once this core has MESI exclusive ownership of their respective lines, becoming globally visible to other cores.
clwb instructions executed at some point, requesting the cache lines be written-back to DRAM at some point after the stores commit.
write-back of line A start at some point after it commits to L1d, and same for line B. They could start in either order since clwb's order isn't guaranteed wrt. other clwb operations to other lines, although in practice they likely start in program oder.
clwb-B finishes becoming persistent
machine loses power, before the in-flight clwb-A made it to the persistence domain. You didn't request the clwb operations be ordered wrt. each other, so this is allowed.

In terms of asm instruction reordering, the following reordering is allowed:

 store A
 store B
 clwb  B
 clwb  A     ; not ordered wrt. store B or clwb B

Of course order of execution vs. reaching the end of the store buffer vs. actual persistent commit are all separate things at least in theory, but if you want to simplify it to all steps of an instruction happening before any effects of another instruction, this reordering is still compatible with all the rules.

I think the key thing you're missing is that clwb A is a separate operation from store A, it doesn't stay stuck to it. That clwb is allowed to "happen" after other later stores. store B is to a different address, so it doesn't order clwb A.

An SFENCE can prevent this.

Thanks for the clarification. One small last followup question, When we say instructions are committed in PO (program order), does PO means any valid order which the compiler can generate based on the memory consistency model (like the asm example that you have mentioned), or is it the order that the programmer has written the program , ie no reordering — Arun Kp, May 18 '21 at 06:15
@ArunKp: That means asm instructions, nothing more nothing less. If you're not writing asm by hand, then yes, the order of the asm instructions depends on the language's memory model. So you'd need a compiler barrier like GNU C `asm("":::"memory")` or C++ `atomic_signal_fence(mo_release)` to order stores wrt. each other in the asm. Of course, `_mm_sfence()` is a compiler barrier as well as emitting an asm fence. See [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/a/50780314) — Peter Cordes, May 18 '21 at 06:23
@ArunKp: and see also https://preshing.com/20120625/memory-ordering-at-compile-time/ — Peter Cordes, May 18 '21 at 06:25

clwb+sfence, can we remove sfence if writes are cache-line aligned?

1 Answers1

Linked