Optimization of fenced memory stores on x86 CPU

Question

mov 0x0ff, 10
sfence 
mov 0x0ff, 12
sfence

Can it executed by x86-CPU as:

 mov 0x0ff, 12
 sfence

?

Both `sfence` instructions are redundant (related: [Does SFENCE prevent the Store Buffer hiding changes from MESI?](https://stackoverflow.com/questions/32681826/does-sfence-prevent-the-store-buffer-hiding-changes-from-mesi)). But even without them, I *think* it would be possible for another thread to observe the `10` sometimes. There's some evidence of merging in the store queue before stores commit to L1D, though. (But I can't find the SO answer or comments about that). — Peter Cordes, Mar 10 '18 at 23:52
Found it: [Unexpectedly poor and weirdly bimodal performance for store loop on Intel Skylake](https://stackoverflow.com/questions/47851120/unexpectedly-poor-and-weirdly-bimodal-performance-for-store-loop-on-intel-skylak) has some evidence that adjacent stores to the same cache line are coalesced / merged in the store buffer and commit as one update. If this happens for stores to the *same* location, then the `10` might never be committed to L1d. I don't know if `sfence` would prevent that or not; I think on paper it doesn't have to, but it might stop the merging on actual CPUs. — Peter Cordes, Mar 11 '18 at 00:12
@PeterCordes `sfence` drains the store buffer (according to Intel), so it should prevent the merging. It's redundant for ordering but I don't think it is for visibility. — Margaret Bloom, Mar 11 '18 at 10:25
@eugene - sfence is not a no-op because it fences not-temporal stores which are not normally ordered with respect to each other or regular stores. — BeeOnRope, Jul 09 '18 at 18:15
@margaret - I have also read the Intel doc where they list `sfence` as draining the store buffer, but I find it hard to believe (at least in the sense of draining the store buffer synchronously before retirement), since it would seem to imply that `sfence; lfence` would be eauivlant to mfence but Intel is explicit that it isn't and it executes considerably faster so as a practical matter I don't think it is equivalent. I feel like that language is left over from an earlier time and can't be relied upon, although I'm admittedly unclear about it. — BeeOnRope, Jul 09 '18 at 18:20
@peter - even with perfect merging at the store buffer level, you could always observe a 10 now and then on most operating systems due to a context switch that falls after the first store but before the second. Such merging makes observations of 10 much less likely however, and perhaps nearly impossible in some scenarios such as high priority threads or cooperative threading. — BeeOnRope, Jul 09 '18 at 18:23
@PeterCordes - keep in mind that both `sfence`s isn't aren't necessarily redundant, since they still serve as `sfence` to code before/after this sequence (i.e., fencing NT stores). You could remove one unconditionally, but not the second unless you knew the surrounding code didn't rely on it. — BeeOnRope, Jul 09 '18 at 19:29
@BeeOnRope thank you for the comment, I feel like an idiot. *Intel Software Development Manual* even says: `SFENCE instructions cannot pass earlier writes` — Eugene, Aug 13 '18 at 09:20

BeeOnRope · Accepted Answer · 2018-07-09T19:40:00.100

3

Yes it is possible that some CPU could execute it as you propose.

Even if you put a stronger fence like mfence in there, or use locked instructions there is certainly no guarantee that the first write isn't optimized away.

This is true in general: the ordering and fencing rules basically tell you which executions are disallowed and hence guaranteed never to occur, but then considering complementary set of executions that are allowed to occur there is usually no guarantee that any particular execution can ever actually be observed.

That said, I'm pretty sure that on current x86 chips you'll always be able to observe the occasional 10 value (even if the fences are omitted entirely), despite any store buffer merging since you could occasionally get an interrupt between the two stores, allowing you to read 10.

Still, it's not guaranteed - one could certainly imagine a dynamically optimizing x86 architecture like Denver or Transmeta could condense the above sequence removing both fences and the first store, making 20 the only observable value.

edited Jul 09 '18 at 19:40

answered Jul 09 '18 at 18:12

BeeOnRope

60,350
16
207
386

Thanks! Why `mfence` causes a switch-context? Because it takes relatively a lot of CPU cycles and it is very probable that swich-context occurs? However, `mfence` is a very good moment for a switch-context :). AFAIK there is no possibility for OS to say to the CPU: if `mfence` occurs please let me know by an interrupt. – Gilgamesz Jul 09 '18 at 18:53
"could condense the above sequence removing both fences and the first store" I cannot imagine how CPU can remove **both** fences. If I as developer put a fence I expect that it wouldn't be removed. And there is a difference between. – Gilgamesz Jul 09 '18 at 18:55
2

@Gilgamesz - `mfence` doesn't cause a context switch and I don't think I implied that. I am just saying that even with a "heavier" fence, there is no guarantee of ever observing the 10 on all past, present and future x86 chips. Of course, if `mfence` takes a lot of cycles, perhaps a context switch is more likely near an `mfence` instruction, but nothing in my answer relies on that and I'm not totally following your discussion of that effect. – BeeOnRope Jul 09 '18 at 19:34
1

Or a dynamic-recompilation like QEMU running x86 code on an ARM or PowerPC, or on SPARC which has a similar memory model. Or after running a binary-to-binary optimizer. – Peter Cordes Jul 09 '18 at 19:35
2

@Gilgamesz - the CPU could remove the fences in the same way that a compiler reorganizes and eliminates various bits of code as long as it preserves the guaranteed semantics of the code. Most CPUs don't do much this type of "optimization", since they don't really have a compilation phase (but even then you can see hints of this stuff, e.g., with mov elimination, unrolling in the loop buffer, etc). However, there have been binary-recompilation x86 CPUs like nVidia's Denver and the Transmeta CPUs that did compile from x86 machine code into another CPU-specific instruction set. – BeeOnRope Jul 09 '18 at 19:36
1

Such CPUs definitely make a variety of "compiler-like" optimizations. Note that removing the second fence needs analysis of the surrounding code: the `sfence`s are redundant for the stores you show, but if there are NT stores before/after this segment, they may have an effect, so they can _often, but not always_ be removed. See also [this comment](https://stackoverflow.com/questions/49214151/optimization-of-fenced-memory-stores-on-x86-cpu#comment89484372_49214151). – BeeOnRope Jul 09 '18 at 19:38
1

Even with fences, there's no guarantee that on any single execution of the sequence, the core loses the cache line to a reader between committing the `10` and the `12`. I guess with enough cores spamming RFOs at this core, you could make it very likely. But still, interesting point that an interrupt could defeat store merging. – Peter Cordes Jul 09 '18 at 19:39
Ok, on x86 `sfence` in pointless unless there is no NT stores before- after all store is a full barrier on x86 – Gilgamesz Jul 09 '18 at 19:45
1

@PeterCordes - of course there is never a guarantee you will _always_ see a 10, but I didn't think we discussing that. I was discussing whether there is a guarantee whether another will _ever_ (alternately, will ever after enough attempts), see a 10. The answer to both question is "no guarantee", but the first one is very obvious I think. The second version however, on lots of today's hardware, might never result in a 10 if there are no interrupts (e.g., if merging actually exists and happens deterministically enough, and isn't defeated by the `sfence`). – BeeOnRope Jul 09 '18 at 19:47
1

@Gilgamesz - actually the opposite: `sfence` is useless unless there _are_ NT stores before or after. Also, store is not a full fence at all (this would be very expensive, like `mfence` which takes ~30 cycles): it has _release_ semantics on x86. – BeeOnRope Jul 09 '18 at 19:48
oh, yes. Obviously on x86 store has release semantic and load has acquire. By full barrier I meant that no store/load can reorderd with following store. " actually the opposite: sfence is unless unless there are NT stores before or after" Yes. I come from East Europe and double negation is not logical here :). So, `sfence` is pointless if there is no NT stores before/after ;), yes? – Gilgamesz Jul 09 '18 at 19:59
1

Correct about NT stores (your last sentence). Even in English double-negation is "problematic" so I try to avoid it, but often forget :). Full barrier generally means that all loads/stores cannot be reordered, either either direction, across it - but the effect of a release store is much weaker: it only applies to earlier loads or stores reordering across it, but not _later_. That's why we don't call it a full barrier. Release semantics are enough for many algorithms, but not enough for many others. – BeeOnRope Jul 09 '18 at 20:02

Optimization of fenced memory stores on x86 CPU

1 Answers1