How cpu use 'mfence' to protect sequential-consistency?

Question

I want to know how cpu use 'mfence' to protect sequential-consistency, who can tell me ?

http://preshing.com/20120515/memory-reordering-caught-in-the-act/: `mfence` is a barrier for StoreLoad reordering. — Peter Cordes, May 22 '18 at 23:32

BeeOnRope · Answer 1 · 2018-05-23T01:54:08.850

3

For sequential consistency for aligned loads and stores, it is sufficient on x86 to follow every store with an mfence instruction. It's not necessary, however: a more aggressive approach only needs to ensure that an mfence instruction appears between every possible pair of store and subsequent load instructions. For example, a series of store instructions not interrupted by a load wouldn't need any mfence except after the final store.

If you want to do a compound operation (like incrementing a value) atomically, you need more than mfence - you need a locked instruction such as lock inc. This also implies the same barrier as mfence, so no additional barrier is needed in this case.

In practice, mfence may not be the ideal choice to enforce sequential consistency even of plain stores purpose because its performance is seems worse than a locked operation, so for example lock xchg can be used in its place.

edited May 23 '18 at 01:54

answered May 23 '18 at 00:14

BeeOnRope

60,350
16
207
386

2

Unfortunately(?) gcc compiles C++11 `atomic_var = 1` to `mov dword [atomic_var], 1` / `mfence`, not `mov eax, 1` / `xchg [atomic_var], eax`. IDK if compiler devs have tested this, or what benchmarks / microbenchmarks they used to decide on this code-gen strategy. With `-mno-sse`, it will indeed use `xchg`, I think. – Peter Cordes May 23 '18 at 01:49
@PeterCordes - yeah I remembered that. My comment perhaps applied to people who were doing this stuff before C++ atomics, like Java which definitely used the "dummy locked op" approach when the operation didn't inherently need an atomic op anyways. I'm also not sure why gcc compiles it that way. It could be entirely false that `mfence` is actually slower today: those are mostly based on back-to-back tests, it could be that when it appears is more fence-sparse code it is faster (e.g., it is less uops). I edited answer to say "may not be ideal" rather than "isn't usually used". – BeeOnRope May 23 '18 at 01:53

score 0 · Answer 2 · answered May 22 '18 at 16:52

0

Basically it does something just short of flushing the current instruction queue of any operations that read or write memory, and stopping any new instructions that read or write memory from being processed until the flush completes.

In practice various parts of instruction processing pipeline: decoding, scheduling, address calculation, page management, etc can be performed so long as memory is not being modified, and any reads or writes to registers can be allowed, so it isn't as bad as a full flush.

As to how they make it happen in the silicon... dunno.

answered May 22 '18 at 16:52

Gem Taylor

5,381
1
9
27

Not just flushing the *instruction* queue, also the store buffer (but this can happen OoO). See [Is a memory barrier an instruction that the CPU executes, or is it just a marker?](https://stackoverflow.com/q/42714599), and [Does an x86 CPU reorder instructions?](https://stackoverflow.com/q/50307693), and [Does a memory barrier acts both as a marker and as an instruction?](https://stackoverflow.com/q/50338253). `mfence` isn't serializing on the instruction stream, only on memory operations, and it makes sure the stores have committed to L1d, not just executed locally in the instruction stream. – Peter Cordes May 22 '18 at 23:31

How cpu use 'mfence' to protect sequential-consistency?

2 Answers2