difference in mfence and asm volatile ("" : : : "memory")

Question

As far as I have understood, mfence is a hardware memory barrier while asm volatile ("" : : : "memory") is a compiler barrier. But,can asm volatile ("" : : : "memory") be used in place of mfence.

The reason I have got confused is this link

I am using both x86 and x64. Should the answer vary for x86 and x64 machines? — Neal, Aug 30 '12 at 04:57
Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory") — Peter Ritchie, Aug 30 '12 at 15:03
"on x86/x64 all stores have a release fence and all loads have an acquire fence". can you point to me some relevant docs regarding this. You can also make that an answer and I will accept, as that will answer my question :) — Neal, Aug 30 '12 at 19:00

Peter Ritchie · Accepted Answer · 2013-10-12T17:50:02.277

27

Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory")

For a good overview of both Intel and AMD as well as references to the relavent manufacturer specs, see http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

Generally things like "volatile" are used on a per-field basis where loads and stores to that field are natively atomic. Where loads and stores to a field are already atomic (i.e. the "operation" in question is a load or a store to a single field and thus the entire operation is atomic) the volatile field modifier or memory barriers are not needed on x86/x64. Portable code notwithstanding.

When it comes to "operations" that are not atomic--e.g. loads or stores to a field that is larger than a native word or loads or stores to multiple fields within an "operation"--a means by which the operation can be viewed as atomic are required regardless of CPU architecture. generally this is done by means of a synchronization primitive like a mutex. Mutexes (the ones I've used) include memory barriers to avoid issues like processor reordering so you don't have to add extra memory barrier instructions. I generally consider not using synchronization primitives a premature optimization; but, the nature of premature optimization is, of course, 97% of the time :)

Where you don't use a synchronization primitive and you're dealing with a multi-field invariant, memory barriers that ensure the processor does not reorder stores and loads to different memory locations is important.

Now, in terms of not issuing an "mfence" instruction in asm volatile but using "memory" in the clobber list. From what I've been able to read

If your assembler instructions access memory in an unpredictable fashion, add `memory' to the list of clobbered registers. This will cause GCC to not keep memory values cached in registers across the assembler instruction and not optimize stores or loads to that memory.

When they say "GCC" and don't mention anything about the CPU, this means it applies to only the compiler. The lack of "mfence" means there is no CPU memory barrier. You can verify this by disassembling the resulting binary. If no "mfence" instruction is issued (depending on the target platform) then it's clear the CPU is not being told to issue a memory fence.

Depending on the platform you're on and what you're trying to do, there maybe something "better" or more clear... portability not withstanding.

edited Oct 12 '13 at 17:50

answered Aug 30 '12 at 19:45

Peter Ritchie

35,463
9
80
98

5

+1 This is 99.9% correct, with the exception that stores to _different_ locations are unordered in a multi-core system (if you need this, you need MFENCE). However, this is usually a "Yeah WTF... who cares?" thing. Instructions on the same core are always realized in the order they execute anyway, and loads/stores on different cores to the same location have the guarantees as you've described. – Damon Aug 30 '12 at 21:15
@Peter thank you for posting this link. I had referred to it earlier as well and my doubt originated from the Peterson lock problem. So, the author mentions "Loads may be reordered with older stores to different locations" which may break the implementation of Peterson's algorithm and an mfence would be required to correctly implement it. But, would "asm volatile" be sufficient too? as it is just a compiler barrier and as mentioned on wikipedia too(http://en.wikipedia.org/wiki/Memory_ordering#Compiler_memory_barrier)these barriers prevent a compiler reordering, not the CPU reordering. – Neal Aug 31 '12 at 10:07
@neal I had assumed you were talking about a single field that you wanted to synchronize across threads/cpus... I've added some detail to my answer w.r.t. non-atomic "operations". – Peter Ritchie Aug 31 '12 at 14:43
Will `MFENCE` prevent out-of-order execution on the CPU? Often I see quoted use a serializing call like `cpuid` before `rdtsc` when your CPUI doesn't support `rdtscp` to prevent reordering around the call to `rdtsc`. Would using `MFENCE` have the same effect as `cpuid`? – Steve Lorimer Sep 28 '12 at 05:54
`MFENCE` is full memory barrier (a combination of SFENCE and LFENCE), it "Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction." – Peter Ritchie Oct 01 '12 at 16:05
This answer is wrong. Sure, x86 has "relatively strong" ordering, so in _some cases_ hardware memory barriers aren't needed, but it certainly has reordering: later loads can be reordered with earlier stores, and a CPU can see its stores out of order with respect to the stores of other CPUs. Those are enough to break many concurrent algorithms, so the "strong memory ordering" isn't "strong enough" usually, and `asm volatile ("" : : : "memory")` isn't going to cut it except in some cases. That's why x86 has `mfence` and why `lock` instructions are special (beyond atomicity). – BeeOnRope Jun 10 '18 at 01:51
Please detail where it's wrong @BeeOnRope. Other than what GCC documents what `volatile("" : : : "memory")` does (in which the answer isn't wrong, or the documentation is wrong), I don't see where we disagree. – Peter Ritchie Jun 11 '18 at 19:53
1

The very first paragraph is all wrong : _Well, a memory barrier is only needed on architectures that have weak memory ordering. x86 and x64 don't have weak memory ordering. on x86/x64 all stores have a release fence and all loads have an acquire fence. so, you should only really need asm volatile ("" : : : "memory")_ This implies that x86 doesn't need (hardware) memory barriers, only compiler barriers like `volatile("" : : : "memory")`, right? That just wrong. x86 memory is relative strong, but definitely re-orders _all the time_ and hardware barriers are needed _all the time. – BeeOnRope Jun 11 '18 at 19:58
The other answers are right, you can read those. Note, there are plenty of things that are correct in your answer (yes, x86 stores and loads do have release and acquire semantics, respectively) - but the conclusion you make doesn't follow! – BeeOnRope Jun 11 '18 at 19:59
@BeeOnRope discussions on these topics always seem to go round and round... :/ I didn't imply that *x86 doesn't need (hardware) memory barriers*. I also did not imply that there was no re-ordering. In later paragraphs, I detail the need for things that cause memory barriers (i.e. application-level invariants), *due to re-ordering*. X86 has re-order, but re-ordering of *different* addresses. You don't need a memory barrier to ensure any load instruction of a memory location sees all previously encountered store instructions. Unless you have a reference that details different information... – Peter Ritchie Jun 11 '18 at 20:33
@BeeOnRope See https://web.archive.org/web/20081203074632/http://www.intel.com/products/processor/manuals/318147.pdf for the reference of the first link – Peter Ritchie Jun 11 '18 at 20:33
You didn't just imply it, you explicitly stated it in your first paragraph: _on x86/x64 all stores have a release fence and all loads have an acquire fence. so, **you should only really need asm volatile ("" : : : "memory")**_. Am I reading that wrong? x86 most definitely needs hardware memory barriers for many non-trivial concurrent algorithms, just like other architectures, although sure the set of possible re-orderings is different so in some cases you don't need a barrier on x86 that you might need on PowerPC or ARM or whatever, but that's not what you are saying, is it? – BeeOnRope Jun 11 '18 at 20:36
1

I'm aware of that old paper, and Vol3 of the current SDM covers it reasonably well (but IMO [something like this](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf) is even better) - but we aren't really arguing about the specifics of the x86 memory model are we? I think we are (at least I am) just arguing that your first and most important claim in the first paragraph in this accepted answer is just straight up wrong. I don't even need to read in detail the rest since the overall principle that "x86 is strong you can get away with compiler barriers" is already false... – BeeOnRope Jun 11 '18 at 20:41
@SteveLorimer: it turns out that unfortunately yes, `mfence` *does* block out-of-order execution on Skylake at least, as well as memory reordering. Only the memory-fence effect is required, the OoO exec barrier is an implementation detail. (Use `lfence` if you want that.) See [Are loads and stores the only instructions that gets reordered?](https://stackoverflow.com/a/50496379) for details. – Peter Cordes Sep 15 '18 at 08:20

score 15 · Answer 2 · edited Sep 15 '18 at 08:05

asm volatile ("" ::: "memory") is just a compiler barrier.
asm volatile ("mfence" ::: "memory") is both a compiler barrier and MFENCE
__sync_synchronize() is also a compiler barrier and a full memory barrier.

so asm volatile ("" ::: "memory") will not prevent CPU reordering independent data instructions per se. As pointed out x86-64 has a strong memory model, but StoreLoad reordering is still possible. If a full memory barrier is needed for your algorithm to work then you neeed __sync_synchronize

score 5 · Answer 3 · edited Jun 10 '18 at 01:53

There are two reorderings, one is compiler reordering, the other one is CPU reordering.

x86/x64 has a relatively strong memory model, but on x86/x64 StoreLoad reordering (later loads passing earlier stores) CAN happen. see http://en.wikipedia.org/wiki/Memory_ordering

asm volatile ("" ::: "memory") is just a compiler barrier.
asm volatile ("mfence" ::: "memory") is both a compiler barrier and CPU barrier.

that means, only use a compiler barrier, you can only prevent compiler reordering, but you can not prevent CPU reordering. that means there is no reordering when compiling source code, but reordering can happen in runtime.

So, it depends your needs, which one to use.

difference in mfence and asm volatile ("" : : : "memory")

3 Answers3

Linked