21

I read the "Intel Optimization guide Guide For Intel Architecture".

However, I still have no idea about when should I use

_mm_sfence()
_mm_lfence()
_mm_mfence()

Could anyone explain when these should be used when writing multi-threaded code?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
prgbenz
  • 1,129
  • 4
  • 13
  • 27
  • @BeeOnRope: I updated / retagged this question to ask what I think the real question was: about these intrinsics in multi-threaded code (the original tags included [tag:parallel-processing].) There are lots of Q&As about the machine instructions, but this one is different because C++'s mem model is weak. You want a way to do an acquire-load or release-store *without* making the compiler emit a useless `lfence` or `sfence`, just stopping compile-time reordering. (http://preshing.com/20120625/memory-ordering-at-compile-time/). Of course in 2018, just use C11 stdatomic / C++11 std::atomic. – Peter Cordes Jun 09 '18 at 16:02
  • @PeterCordes So you think this question is about compiler barriers in a way? That is, a good answer might be along the lines of `lfence` and `sfence` instructions are generally useless at the x86 assembly level, but you might want to insert a compiler barrier to prevent compiler reorderings? BTW, I don't know of finer-grained-than-full compiler-barriers for most compilers, but MSVC does have `_[Read|Write]Barrier`. I guess you could invent some types of barriers with inline asm and clever use of constraints. – BeeOnRope Jun 09 '18 at 16:46
  • `std::atomic_signal_fence(std::memory_order_release)` with gcc does seem to order even non-atomic variables, but that may be an implementation detail. I haven't looked under the hood. – Peter Cordes Jun 09 '18 at 16:49
  • @PeterCordes - it is supposed to order non-atomic variables, isn't it? Just like most of the `mo_` orders on atomic variables also order somehow the surrounding non-atomic accesses. For fences, ordering of non-atomic variables is the _main_ purpose, I think. Maybe I didn't understand what you meant... – BeeOnRope Jun 09 '18 at 16:56

4 Answers4

8

If you're using NT stores, you might want _mm_sfence or maybe even _mm_mfence. The use-cases for _mm_lfence are much more obscure.

If not, just use C++11 std::atomic and let the compiler worry about the asm details of controlling memory ordering.


x86 has a strongly-ordered memory model, but C++ has a very weak memory model (same for C). For acquire/release semantics, you only need to prevent compile-time reordering. See Jeff Preshing's Memory Ordering At Compile Time article.

_mm_lfence and _mm_sfence do have the necessary compiler-barrier effect, but they will also cause the compiler to emit a useless lfence or sfence asm instruction that makes your code run slower.

There are better options for controlling compile-time reordering when you aren't doing any of the obscure stuff that would make you want sfence.

For example, GNU C/C++ asm("" ::: "memory") is a compiler barrier (all values have to be in memory matching the abstract machine because of the "memory" clobber), but no asm instructions are emitted.

If you're using C++11 std::atomic, you can simply do shared_var.store(tmp, std::memory_order_release). That's guaranteed to become globally visible after any earlier C assignments, even to non-atomic variables.

_mm_mfence is potentially useful if you're rolling your own version of C11 / C++11 std::atomic, because an actual mfence instruction is one way to get sequential consistency, i.e. to stop later loads from reading a value until after preceding stores become globally visible. See Jeff Preshing's Memory Reordering Caught in the Act.

But note that mfence seems to be slower on current hardware than using a locked atomic-RMW operation. e.g. xchg [mem], eax is also a full barrier, but runs faster, and does a store. On Skylake, the way mfence is implemented prevents out-of-order execution of even non-memory instruction following it. See the bottom of this answer.

In C++ without inline asm, though, your options for memory barriers are more limited (How many memory barriers instructions does an x86 CPU have?). mfence isn't terrible, and it is what gcc and clang currently use to do sequential-consistency stores.

Seriously just use C++11 std::atomic or C11 stdatomic if possible, though; It's easier to use and you get quite good code-gen for a lot of things. Or in the Linux kernel, there are already wrapper functions for inline asm for the necessary barriers. Sometimes that's just a compiler barrier, sometimes it's also an asm instruction to get stronger run-time ordering than the default. (e.g. for a full barrier).


No barriers will make your stores appear to other threads any faster. All they can do is delay later operations in the current thread until earlier things happen. The CPU already tries to commit pending non-speculative stores to L1d cache as quickly as possible.


_mm_sfence is by far the most likely barrier to actually use manually in C++

The main use-case for _mm_sfence() is after some _mm_stream stores, before setting a flag that other threads will check.

See Enhanced REP MOVSB for memcpy for more about NT stores vs. regular stores, and x86 memory bandwidth. For writing very large buffers (larger than L3 cache size) that definitely won't be re-read any time soon, it can be a good idea to use NT stores.

NT stores are weakly-ordered, unlike normal stores, so you need sfence if you care about publishing the data to another thread. If not (you'll eventually read them from this thread), then you don't. Or if you make a system call before telling another thread the data is ready, that's also serializing.

sfence (or some other barrier) is necessary to give you release/acquire synchronization when using NT stores. C++11 std::atomic implementations leave it up to you to fence your NT stores, so that atomic release-stores can be efficient.

#include <atomic>
#include <immintrin.h>

struct bigbuf {
    int buf[100000];
    std::atomic<unsigned> buf_ready;
};

void producer(bigbuf *p) {
  __m128i *buf = (__m128i*) (p->buf);

  for(...) {
     ...
     _mm_stream_si128(buf,   vec1);
     _mm_stream_si128(buf+1, vec2);
     _mm_stream_si128(buf+2, vec3);
     ...
  }

  _mm_sfence();    // All weakly-ordered memory shenanigans stay above this line
  // So we can safely use normal std::atomic release/acquire sync for buf
  p->buf_ready.store(1, std::memory_order_release);
}

Then a consumer can safely do if(p->buf_ready.load(std::memory_order_acquire)) { foo = p->buf[0]; ... } without any data-race Undefined Behaviour. The reader side does not need _mm_lfence; the weakly-ordered nature of NT stores is confined entirely to the core doing the writing. Once it becomes globally visible, it's fully coherent and ordered according to the normal rules.

Other use-cases include ordering clflushopt to control the order of data being stored to memory-mapped non-volatile storage. (e.g. an NVDIMM using Optane memory, or DIMMs with battery-backed DRAM exist now.)


_mm_lfence is almost never useful as an actual load fence. Loads can only be weakly ordered when loading from WC (Write-Combining) memory regions, like video ram. Even movntdqa (_mm_stream_load_si128) is still strongly ordered on normal (WB = write-back) memory, and doesn't do anything to reduce cache pollution. (prefetchnta might, but it's hard to tune and can make things worse.)

TL:DR: if you aren't writing graphics drivers or something else that maps video RAM directly, you don't need _mm_lfence to order your loads.

lfence does have the interesting microarchitectural effect of preventing execution of later instructions until it retires. e.g. to stop _rdtsc() from reading the cycle-counter while earlier work is still pending in a microbenchmark. (Applies always on Intel CPUs, but on AMD only with an MSR setting: Is LFENCE serializing on AMD processors?. Otherwise lfence runs 4 per clock on Bulldozer family, so clearly not serializing.)

Since you're using intrinsics from C/C++, the compiler is generating code for you. You don't have direct control over the asm, but you might possibly use _mm_lfence for things like Spectre mitigation if you can get the compiler to put it in the right place in the asm output: right after a conditional branch, before a double array access. (like foo[bar[i]]). If you're using kernel patches for Spectre, I think the kernel will defend your process from other processes, so you'd only have to worry about this in a program that uses a JIT sandbox and is worried about being attacked from within its own sandbox.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    It is possible that `sfence; lfence`, if `sfence` flushes the store buffer, could make stores appear faster to other threads, by effectively pausing other subsequent load activity that might compete for L1 bandwidth and other resources like LFBs. Even subsequent _store_ activity could compete in this way, although that seems less likely (it depends on the details of RFO prefetching). This is fairly obscure though and seems unlikely to matter much in practice. You could also use `pause`, although it's a lot slower on Skylake+. – BeeOnRope Jun 10 '18 at 21:22
6

Here is my understanding, hopefully accurate and simple enough to make sense:

(Itanium) IA64 architecture allows memory reads and writes to be executed in any order, so the order of memory changes from the point of view of another processor is not predictable unless you use fences to enforce that writes complete in a reasonable order.

From here on, I am talking about x86, x86 is strongly ordered.

On x86, Intel does not guarantee that a store done on another processor will always be immediately visible on this processor. It is possible that this processor speculatively executed the load (read) just early enough to miss the other processor's store (write). It only guarantees the order that writes become visible to other processors is in program order. It does not guarantee that other processors will immediately see any update, no matter what you do.

Locked read/modify/write instructions are fully sequentially consistent. Because of this, in general you already handle missing the other processor's memory operations because a locked xchg or cmpxchg will sync it all up, you will acquire the relevant cache line for ownership immediately and will update it atomically. If another CPU is racing with your locked operation, either you will win the race and the other CPU will miss the cache and get it back after your locked operation, or they will win the race, and you will miss the cache and get the updated value from them.

lfence stalls instruction issue until all instructions before the lfence are completed. mfence specifically waits for all preceding memory reads to be brought fully into the destination register, and waits for all preceding writes to become globally visible, but does not stall all further instructions as lfence would. sfence does the same for only stores, flushes write combiner, and ensures that all stores preceding the sfence are globally visible before allowing any stores following the sfence to begin execution.

Fences of any kind are rarely needed on x86, they are not necessary unless you are using write-combining memory or non-temporal instructions, something you rarely do if you are not a kernel mode (driver) developer. Normally, x86 guarantees that all stores are visible in program order, but it does not make that guarantee for WC (write combining) memory or for "non-temporal" instructions that do explicit weakly ordered stores, such as movnti.

So, to summarize, stores are always visible in program order unless you have used special weakly ordered stores or are accessing WC memory type. Algorithms using locked instructions like xchg, or xadd, or cmpxchg, etc, will work without fences because locked instructions are sequentially consistent.

doug65536
  • 6,562
  • 3
  • 43
  • 53
  • 4
    You normally don't need `lfence` ever. You only need `sfence` [after weakly-ordered `movnt` streaming stores](https://stackoverflow.com/a/44866652/224132). You need `mfence` (or a `lock`ed operation) to get sequential consistency instead of just release/acquire. (See [Memory Reordering Caught in the Act](http://preshing.com/20120515/memory-reordering-caught-in-the-act/) for an example.) – Peter Cordes Jul 02 '17 at 01:39
  • You normally need `lfence` because C++ compiler. – Marek Vitek Jul 09 '17 at 20:59
  • 1
    `lfence` doesn't discard speculatively executed stores. `lfence` is just an instruction stream serializer: it waits until all previous instructions (of any type, not just memory access) have retired before proceeding, and no later instructions will execute while it is waiting. It is not useful for ordering memory accesses in normal user-mode programs. It's main use there is as an OoO barrier for profiling small regions of code more consistently. `sfence` is similarly not useful except in conjunction with so-called "non-temporal" stores, like `movntq`. – BeeOnRope Jun 08 '18 at 22:33
  • @BeeOnRope Thanks for bringing this answer to my attention, it is from quite a long time ago and I have updated it with hopefully more accurate information. – doug65536 Jun 09 '18 at 14:37
  • You introduced an inaccuracy: `sfence` doesn't stop later stores from *executing* (writing their address+data to the store buffer aka memory-order-buffer), only from making their data globally visible. Also, `lfence` only stops *execution*; it doesn't necessarily stop issue/rename from the front-end into the out-of-order part of the core. (In Intel terminology, sending a uop from the scheduler to an execution unit is called "dispatch", but the other computer-architecture convention is to call that "issue".) – Peter Cordes Jun 09 '18 at 15:48
  • Also, this doesn't fully answer the question, which is asking about the C/C++ intrinsics in C and C++, not just the machine instructions. Stopping compile-time reordering is necessary, but doing so without making the compiler emit `lfence` or `sfence` is not achievable with any of these Intel intrinsics. – Peter Cordes Jun 09 '18 at 15:53
  • 1
    @PeterCordes I think `lfence` also does stop issue (Intel terms: i.e., sending ops _to_ the scheduler). Once the uops are in the scheduler, it's too hard to separate them before/after, so it seem (from patents, etc) that `lfence` just stops issue until it retires. So I think renaming stops, but everything before that can keep running and queuing in the IDQ. – BeeOnRope Jun 09 '18 at 17:10
  • 1
    @BeeOnRope: That would make sense. I was thinking about whether it's testable. Maybe with a latency bottleneck after a bunch of NOPs, and see if more NOPs reduce throughput. If uops from after an `lfence` are all sitting in the scheduler waiting to be allowed to start, then more uops won't matter unless we create a front-end bottleneck bigger than the dep chain. – Peter Cordes Jun 09 '18 at 17:20
  • @PeterCordes - yup, I think your test would work. You could probably also read the front-end behavior directly via `rdpmc` interleaved with `lfence` and some other instructions to see if instructions can issue across an lfence. – BeeOnRope Jun 09 '18 at 18:53
  • @BeeOnRope What's the difference between a store "executing" and a store being "globally visible"? I think you are splitting hairs. Does the programmer need to know the microarchitectural details of how the pipeline does it? No, they don't. – doug65536 Jun 09 '18 at 20:17
  • @doug65536 - I think maybe you meant to reply to [Peter's comment](https://stackoverflow.com/questions/4537753/when-should-i-use-mm-sfence-mm-lfence-and-mm-mfence/12850294?noredirect=1#comment88558885_12850294), as I didn't mention that? Anyways, the difference is potentially relevant because between execution and global visibility there is a period where the store is visible to the local core but not to other cores (this leads to the "stores are seen in a consistent order by CPUs _other than those who performed the stores_ rule in x86). This difference between execution and ... – BeeOnRope Jun 09 '18 at 20:20
  • @BeeOnRope Oh you're right, it was Peter Cordes that [said that](https://stackoverflow.com/questions/4537753/when-should-i-use-mm-sfence-mm-lfence-and-mm-mfence/12850294?noredirect=1#comment88558885_12850294). Sorry. – doug65536 Jun 09 '18 at 20:21
  • ... global visibility is also means that (earlier) store and (later) load reordering on x86 must be allowed for reasonable performance, and it is. So this concept (basically, store buffering and forwarding) pretty much explains _all_ the interesting reordering behavior on x86! All that said, I'm not sure if Peter's distinction is actually programmer visible: perhaps stores actually don't execute until the `sfence` retires, or perhaps different implementations have been used in different CPUs, etc. – BeeOnRope Jun 09 '18 at 20:22
3

The intrinsic calls you mention all simply insert an sfence, lfence or mfence instruction when they are called. So the question then becomes "What are the purposes of those fence instructions"?

The short answer is that lfence is completely useless* and sfence almost completely useless for memory ordering purposes for user-mode programs in x86. On the other hand, mfence serves as a full memory barrier, so you might use it in places where you need a barrier if there isn't already some nearby lock-prefixed instruction providing what you need.

The longer-but-still short answer is...

lfence

lfence is documented to order loads prior to the lfence with respect to loads after, but this guarantee is already provided for normal loads without any fence at all: that is, Intel already guarantees that "loads aren't reordered with other loads". As a practical matter, this leaves the purpose of lfence in user-mode code as an out-of-order execution barrier, useful perhaps for carefully timing certain operations.

sfence

sfence is documented to order stores before and after in the same way that lfence does for loads, but just like loads the store order is already guaranteed in most cases by Intel. The primary interesting case where it doesn't is the so-called non-temporal stores such as movntdq, movnti, maskmovq and a few other instructions. These instructions don't play by the normal memory ordering rules, so you can put an sfence between these stores and any other stores where you want to enforce the relative order. mfence works for this purpose too, but sfence is faster.

mfence

Unlike the other two, mfence actually does something: it serves as a full memory barrier, ensuring that all of the previous loads and stores will have completed1 before any of the subsequent loads or stores begin execution. This answer is too short to explain the concept of a memory barrier fully, but an example would be Dekker's algorithm, where each thread wanting to enter a critical section stores to a location and then checks to see if the other thread has stored something to its location. For example, on thread 1:

mov   DWORD [thread_1_wants_to_enter], 1  # store our flag
mov   eax,  [thread_2_wants_to_enter]     # check the other thread's flag
test  eax, eax
jnz   retry
; critical section

Here, on x86, you need a memory barrier in between the store (the first mov), and the load (the second mov), otherwise each thread could see zero when they read the other's flag because the x86 memory model allows loads to be re-ordered with earlier stores. So you could insert an mfence barrier as follows to restore sequential consistency and the correct behavior of the algorithm:

mov   DWORD [thread_1_wants_to_enter], 1  # store our flag
mfence
mov   eax,  [thread_2_wants_to_enter]     # check the other thread's flag
test  eax, eax
jnz   retry
; critical section

In practice, you don't see mfence as much as you might expect, because x86 lock-prefixed instructions have the same full-barrier effect, and these are often/always (?) cheaper than an mfence.


1 E.g., loads will have been satisfied and stores will have become globally visible (although it would be implemented differently as long as the visible effect wrt ordering is "as if" that occurred).

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Maybe worth mentioning that the memory-ordering use-case for `lfence` is after loads from video memory, especially with `movntdqa`, or anything else that's mapped WC. So you could say "if you haven't mapped video RAM into your user-space program, you don't need `lfence`". I'm sure people will wonder when it ever is useful; I know I would, so a tiny hint / summary is useful. User-space can map video RAM with the kernel's help... – Peter Cordes Jun 09 '18 at 08:12
  • 1
    I'm deliberately trying to keep this a fairly short and direct answer, even if it is perhaps at the cost of not being exhaustively accurate when it comes to every possible `lfence` use. That is, I don't want to make a @PeterCordes-style answer which necessarily covers every possibility and often spends more prose on that than the 99% case (not that this is a problem, I also write such answers - but I don't want it here). Are there user-mode applications that map WC video ram into their address space? Probably, but a very tiny fraction. Are there some of those who need ... – BeeOnRope Jun 09 '18 at 16:36
  • 1
    ... load-load ordering (but not other types of ordering) with respect to loads from video RAM and who aren't already using some type of synchronization which provides it? This seems like a small slice of the earlier small slice. Out of that minuscule group, for how many is `lfence` interesting in the sense that it provides any type of improvement over `mfence`? I don't know, but I think it's very small. Out of curiosity have you ever seen `lfence` in a real program dealing with WC reads from video RAM? BTW, if I was going to add another `lfence` use it would be meltdown/spectre mitigation. – BeeOnRope Jun 09 '18 at 16:38
  • I wrote up an answer. And yeah good point about the use-cases for `lfence` being *tiny*. I think you'd want it after reading a "ready" flag before reading WC memory. I've never looked at code for doing stuff with WC memory at all, other than Intel's copy-back-to-WB example. https://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers. The reason I like to mention it is that when *I* was learning stuff, I always figured LFENCE must exist for a reason, and I wanted to know what it was. You can say "ok I'm not using WC memory so I don't care about lfence". – Peter Cordes Jun 10 '18 at 03:30
  • 1
    @PeterCordes - looks good. I have also wondered about the purpose of `lfence`. I don't think it is actually explained by "mapping WC memory into user space". It seems to me that these instructions were introduced at a time of "great hope" for non-temporal instructions on WB memory, and perhaps when the memory model wasn't really nailed down and Intel architects still possibly wanted to allow load-load reordering in some circumstances (even outside of NT loads) in WB mode, or perhaps were considering another higher-performance weaker mode, like WB+ that allowed more reorderings. – BeeOnRope Jun 10 '18 at 20:20
  • 1
    That kind of didn't pan out: they stuck with a strong model, perhaps just by default since by not defining it very well in the first MP systems, people were probably already relying on existing behaviors (although it took them several iterations to really settle on a model and even today it's hard to read the document). So then I think `lfence` was just kind of orphaned - the WC video RAM case seems unlikely to me since `mfence` serves the same purpose, and such scenarios existed long before `lfence` (indeed, were more common back in DOS and non-protected OSes). This is pure speculation... – BeeOnRope Jun 10 '18 at 20:23
  • That actually makes a lot of sense. That might be why they documented the execution-barrier behaviour so it was useful for *something*, with the memory-barrier use-case being non-existent. (Although AMD *doesn't* make it an execution barrier. IDK how else to explain the 4-per-clock throughput on Bulldozer/Ryzen, and AMD's [insn set manual](https://support.amd.com/TechDocs/24594.pdf) only mentions memory, not all instructions, in the `lfence` entry. So I guess we should be careful writing answers that claim it's always an exec barrier. Ugh.) – Peter Cordes Jun 10 '18 at 22:10
1

Caveat: I'm no expert in this. I'm still trying to learn this myself. But since no one has replied in the past two days, it seems experts on memory fence instructions are not plentiful. So here's my understanding ...

Intel is a weakly-ordered memory system. That means your program may execute

array[idx+1] = something
idx++

but the change to idx may be globally visible (e.g. to threads/processes running on other processors) before the change to array. Placing sfence between the two statements will ensure the order the writes are sent to the FSB.

Meanwhile, another processor runs

newestthing = array[idx]

may have cached the memory for array and has a stale copy, but gets the updated idx due to a cache miss. The solution is to use lfence just beforehand to ensure the loads are synchronized.

This article or this article may give better info

Mark Borgerding
  • 8,117
  • 4
  • 30
  • 51
  • 2
    No, x86 stores are strongly-ordered by default. Compile-time reordering could produce the reordering you describe (if you fail to use `std::atomic` with `memory_order_release` or stronger), but the stores from the x86 instructions `mov [array + rcx], eax` / `mov [idx], rcx` would become globally visible to other threads in that order. Only `MOVNT` streaming stores are weakly-ordered (so you need `sfence` after them before storing to a `buffer_ready` flag). You normally never need `lfence`, unless you're using weakly-ordered loads from video memory or something. – Peter Cordes Jul 02 '17 at 01:33
  • 2
    See also [my answer on a more recent sfence question](https://stackoverflow.com/a/44866652/224132). Also, Jeff Preshing's excellent articles, like this [weak vs. strong memory model](http://preshing.com/20120930/weak-vs-strong-memory-models/) post. (It was written 2 years after you posted this. I'm not intending to be rude about an old answer, but it is almost totally wrong, xD) – Peter Cordes Jul 02 '17 at 01:36
  • @PeterCordes You are completely wrong. 1) you don't need std::atomic to achieve correct order. In fact std::atomic didn't exist until C++11. So it didn't existed at the time the post has been made. 2) lfence is used to ensure correct instruction order on consumer side. So if you want to read your buffer after seeing `buffer_ready` flag, then you use lfence to make sure that read from buffer doesn't happen earlier. – Marek Vitek Jul 09 '17 at 20:55
  • @MarekVitek: you only need a compiler barrier, not an `lfence` instruction, as a read memory barrier for x86. Pre-C++11 you could use `asm volatile("" ::: "memory")` (no instructions, but a memory clobber) for GNU C as an equivalent for `atomic_thread_fence(memory_order_acquire)` or for `mo_release`. See http://preshing.com/20120625/memory-ordering-at-compile-time/ for examples of using exactly this. The Linux kernel defines its own macros instead of using `stdatomic`, and that's exactly what it does. `__smp_rmb()` ends up defined as `barrier()`, which is just an empty asm statement. – Peter Cordes Jul 09 '17 at 22:03
  • 1
    All of this is because x86 has a strong memory model, but C++ has a weak memory model. Preventing compile-time reordering is all you need to do. Inserting `lfence` or `sfence` may not hurt performance much, but they're not necessary if you haven't used weakly-ordered MOVNT loads or stores. – Peter Cordes Jul 09 '17 at 22:06
  • This is very naïve. See 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations and 8.2.3.5 Intra-Processor Forwarding Is Allowed - [Intel Programmers Manual](https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf) Even on intel processors with strong memory order you are not guaranteed to see data in the order you expect without fences. And I am not mentioning other platforms where rules are more relax. Unless you have total control over what assembly will your compiler generate you can't say for sure that your code is resistant to this reorder – Marek Vitek Jul 10 '17 at 12:24
  • 1
    @MarekVitek: SFENCE and LFENCE don't help you avoid those kinds of reordering, only MFENCE does that. See [Does SFENCE prevent the Store Buffer hiding changes from MESI?](https://stackoverflow.com/q/32681826), and [Why is (or isn't?) SFENCE + LFENCE equivalent to MFENCE?](https://stackoverflow.com/a/50322404). To get a release-store in C++, you only need to tell your compiler that's what you want. `_mm_sfence()` has that effect, but it also forces it to also emit a useless `sfence` asm instruction. There are other options that don't have that side effect, like `asm("" ::: "memory");`. – Peter Cordes Jun 08 '18 at 13:22
  • @PeterCordes in the code example Mark presented are LFENCE and SFENCE just fine. Maybe even memory barrier will just do the trick, unless compiler emits weakly ordered instruction. But yes, there are cases when you will need full fence and thus use MFENCE. – Marek Vitek Jun 08 '18 at 19:03
  • 1
    @MarekVitek - you are wrong and Peter is right here. Intel has a relatively strong model, and stores aren't re-ordered with other stores and loads aren't re-ordered with other loads (except perhaps in the SLF scenario which doesn't apply here). So if you write the array element, and then update the index, any other CPU that sees the index update is guaranteed to see the write to the array element also. Of course, you need to prevent compiler re-ordering, still! `lfence` and `sfence` are largely useless as fences in x86 - they have only very obscure uses not related to above. – BeeOnRope Jun 08 '18 at 22:31
  • @BeeOnRope LFENCE is used for serializing load-from-memory instructions. But doesn't prevent from reordering relative to writes. So it is usually used in consumer threads. SFENCE is used for serializing store-to-memory instructions. But doesn't prevent from reordering relative to reads. So it is usually used in producer threads, where you write your data and the set flag saying that data are ready. MFENCE is used for serializing both load and store to memory. So it is used in threads that do both and read from memory shared between threads. – Marek Vitek Jun 18 '18 at 20:32
  • And yes, on x86 stores are not reordered with other stores. So you don't usually need SFENCE. Similarly it goes for loads. There are some exceptions to this in form of instructions that don't follow these rules like MOVNTI, MOVNTQ, ..., and that is the place where LFENCE and SFENCE comes in handy. So as long as you or your complier don't use these, then you are OK. And there are situations where you start combining reads and writes in single thread. You should then be aware, that stores can be reordered with reads that come later. So your code might say write to X and the read from Y. – Marek Vitek Jun 18 '18 at 20:32
  • But in fact it can be executed other way around. If you need to make sure these are in proper order, then you use MFENCE as it makes sure that no store crosees any other store or load. And no load crosses any other load or store. I strongly recommend reading section 8.2.3 of intel Software Developer's Manual. And 8.2.2 where it explicitly says `Writes to memory are not reordered with other writes, with the following exceptions: — streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD);` – Marek Vitek Jun 18 '18 at 20:32
  • Given the above, you can assert that LFENCE and SFENCE has only very limited use because order of stores relative to other stores are not to be reordered. But you might still need memory barrier to tell the compiler not to reorder instructions during compilation. Same goes for loads. While on the other hand MFENCE is more usefull as it protects from stores being reordered with loads. And last but not least this holds for x86 but not for other architectures, that are being targeted by C++ programmers, where LFENCE and SFENCE instructions are much more useful. – Marek Vitek Jun 18 '18 at 20:33
  • 1
    @MarekVitek - you said a lot of things there and it seems basically correct, so I think you have come around to our position: for normal loads and stores `sfence` and `lfence` don't help, and so this answer is wrong. Only for special types of stores ("non temporal") that compilers don't emit unless you tell them to (or if they do use them, e.g., as part of an inlined `memcpy` implementation, they fence it) is `sfence` useful and `lfence` is generally not useful for ordering at all. Of course you need compiler barriers, but those aren't called `sfence` or `lfence`! – BeeOnRope Jun 18 '18 at 20:38
  • 1
    I don't think you can can talk about non-x86 platforms in the same breath as you mention `sfence` and `lfence`: saying those are useful on non-x86 platforms is largely nonsense since those are _x86 specific barriers_! Yes, other platforms also have barriers but they aren't called `lfence` or `sfence` (that I've seen), so if you use those words, you are talking about x86. The way barriers work on other platforms is much more complex than x86, and deserves its own treatment. Note that @PeterCordes already mentioned the caveat with non-temporal stores. – BeeOnRope Jun 18 '18 at 20:41