Make previous NT stores visible to subsequent memory loads in other threads

Question

I want to store data in a large array with _mm256_stream_si256() called in a loop. As I understood, a memory fence is then needed to make these changes visible to other threads. The description of _mm_sfence() says

Perform a serializing operation on all store-to-memory instructions that were issued prior to this instruction. Guarantees that every store instruction that precedes, in program order, is globally visible before any store instruction which follows the fence in program order.

But will my recent stores of the current thread be visible to subsequent load instructions too (in the other threads)? Or do I have to call _mm_mfence()? (The latter seems to be slow)

UPDATE: I saw this question earlier: when should I use _mm_sfence _mm_lfence and _mm_mfence . The answers there rather focus on when to use fence in general. My question is more specific and the answers in that question are not likely to address this (and don't currently do this).

UPDATE2: following the comments/answers, let's define "subsequent loads" as the loads in a thread that subsequently takes the lock which the current thread currently holds.

Possible duplicate of [when should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/questions/4537753/when-should-i-use-mm-sfence-mm-lfence-and-mm-mfence) — Richard Critten, Jul 01 '17 at 18:25
Accessing recently stored data breaks the whole purpose of `_mm256_stream_si256`, which is to write into memory bypassing cache when you know that you won't access recently stored data. — user7860670, Jul 01 '17 at 18:29
@VTT, usually it's not accessed immediately. But this may occasionally happen, and I want the program to be correct in that case. — Serge Rogatch, Jul 01 '17 at 18:30
Well, if it happens only occasionally then there is no point to bother with performance impact of fencing since you will most likely have to deal with cache miss as well and both of these won't happen often enough to impact the performance considerably. — user7860670, Jul 01 '17 at 18:34
@VTT, fence is called systematically, that's why I want it to be fast. — Serge Rogatch, Jul 01 '17 at 18:38
That's not about the C nor the C++ language but some architecture/compiler-specific extension or machine code.. — too honest for this site, Jul 01 '17 at 18:44
re: your edit: did you mean to ask " will my recent stores be **globally** visible **before** subsequent load instructions too"? i.e. whether `sfence` stops [StoreLoad reordering](http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/) in the order of your thread's stores and loads becoming globally visible? (It doesn't, only `mfence` prevents StoreLoad reordering). The two sentences where you bolded things are talking about completely different things. I'm still not sure if you really were just wondering about visibility in the thread that did the store. — Peter Cordes, Jul 02 '17 at 06:39
@PeterCordes , such a subtle topic :) . I meant to ask "will my recent stores from the current thread be visible to subsequent loads in the other threads?" (after `sfence`) — Serge Rogatch, Jul 02 '17 at 07:15
What do you mean by "subsequent", then? Are you talking about a flag variable creating [a "synchronizes-with" relationship](http://preshing.com/20130823/the-synchronizes-with-relation/) with something in the other thread? It sounds like you're confused about something fundamental, or else I'm just not understanding your questions, but I'm not sure what. `sfence` doesn't make stores instantly visible, it just limits *reordering* of the current thread's stores. With or without it, all stores will eventually become globally visible! — Peter Cordes, Jul 02 '17 at 07:56
Err, at least you *were* confused. You're all sorted out now, right? — Peter Cordes, Jul 02 '17 at 08:19
@PeterCordes, by "subsequent" I mean happening later in time. "`sfence` doesn't make stores instantly visible, it just limits reordering of the current thread's stores" - this is close. There will eventually be some synchronization, at least a spin lock with `memory_order_acquire` on enter and `memory_order_release` on exit. So perhaps I can just omit `sfence`. However, I wrote the current question because I saw a suggestion to use `_mm_sfence()` after `_mm256_stream_si256()` e.g. here https://stackoverflow.com/a/37092/1915854 — Serge Rogatch, Jul 02 '17 at 10:30
Ok, then yes you need `sfence` after NT stores to make sure they don't violate the release semantics of a later store that you're using for synchronization. (Since unlocking a spinlock is usually just a release-store, not a `lock xadd` or something). The part of my answer that explains how normal synchronization stuff doesn't try to deal with weakly-ordered stores applies to normal spinlock library functions as well as lock-free `std::atomic` stuff. — Peter Cordes, Jul 02 '17 at 10:44
*by "subsequent" I mean happening later in time.* There's no way to make this happen unless you limit when those loads can be executed, by using something that synchronizes the producer thread with the consumer. As worded, you're asking for `sfence` to make NT stores globally visible the instant it executes, so that loads on other cores that execute 1 clock cycle after `sfence` will see the stores. A sane definition of "subsequent" would be "in the next thread that takes the lock this thread currently holds". — Peter Cordes, Jul 02 '17 at 10:50
finally :) I did a quick update of my answer to insert some of that, especially to make a point of mentioning normal locking instead of a lock-free producer-consumer model. — Peter Cordes, Jul 02 '17 at 11:03
It's better not to even talk about _time_ when discussing memory models and concurrency, since there just isn't any "global clock" that can be used to determine some type of global ordering. Not does such a clock not exist in a engineering or architectural sense, the concept of some global time isn't even well-founded in a [deep physical sense](https://en.wikipedia.org/wiki/Relativity_of_simultaneity). That's exactly why Peter, machine memory models and language memory models are usually always expressed in terms of reorderings, happens-before or other _relative_ concepts. — BeeOnRope, Jul 03 '17 at 00:09

Peter Cordes · Accepted Answer · 2022-01-05T05:51:37.920

But will my recent stores be visible to subsequent load instructions too?

This sentence makes little sense. Loads are the only way any thread can see the contents of memory. Not sure why you say "too", since there's nothing else. (Other than DMA reads by non-CPU system devices.)

The definition of a store becoming globally visible is that loads in any other thread will get the data from it. It means that the store has left the CPU's private store-buffer and is part of the coherency domain that includes the data caches of all CPUs. (https://en.wikipedia.org/wiki/Cache_coherence).

CPUs always try to commit stores from their store buffer to the globally visible cache/memory state as quickly as possible. All you can do with barriers is make this thread wait until that happens before doing later operations. That can certainly be necessary in multithreaded programs with streaming stores, and it looks like that's what you're actually asking about. But I think it's important to understand that NT stores do reliably become visible to other threads very quickly even with no synchronization.

A mutex unlock on x86 is sometimes a lock add, in which case that's a full fence for NT stores already. But if you can't rule out a mutex implementation using a simple mov store then you need at least sfence at some point after NT stores, before unlock.

Normal x86 stores have release memory-ordering semantics (C++11 std::memory_order_release). MOVNT streaming stores have relaxed ordering, but mutex / spinlock functions, and compiler support for C++11 std::atomic, basically ignores them. For multi-threaded code, you have to fence them yourself to avoid breaking the synchronization behaviour of mutex / locking library functions, because they only synchronize normal x86 strongly-ordered loads and stores.

Loads in the thread that executed the stores will still always see most recently stored value, even from movnt stores. You never need fences in a single-threaded program. The cardinal rule of out-of-order execution and memory reordering is that it never breaks the illusion of running in program order within a single thread. Same thing for compile-time reordering: since concurrent read/write access to shared data is C++ Undefined Behaviour, compilers only have to preserve single-threaded behaviour unless you use fences to limit compile-time reordering.

MOVNT + SFENCE is useful in cases like producer-consumer multi-threading, or with normal locking where the unlock of a spinlock is just a release-store.

A producer thread writes a big buffer with streaming stores, then stores "true" (or the address of the buffer, or whatever) into a shared flag variable. (Jeff Preshing calls this a payload + guard variable).

A consumer thread is spinning on that synchronization variable, and starts reading the buffer after seeing it become true.

The producer must use sfence after writing the buffer, but before writing the flag, to make sure all the stores into the buffer are globally visible before the flag. (But remember, NT stores are still always locally visible right away to the current thread.)

(With a locking library function, the flag being stored to is the lock. Other threads trying to acquire the lock are using acquire-loads.)

std::atomic <bool> buffer_ready;

producer() {
    for(...) {
        _mm256_stream_si256(buffer);
    }
    _mm_sfence();

    buffer_ready.store(true, std::memory_order_release);
}

The asm would be something like

 vmovntdq  [buf], ymm0
 ...
 sfence
 mov  byte [buffer_ready], 1

Without sfence, some of the movnt stores could be delayed until after the flag store, violating the release semantics of the normal non-NT store.

If you know what hardware you're running on, and you know the buffer is always large, you might get away with skipping the sfence if you know the consumer always reads the buffer from front to back (in the same order it was written), so it's probably not possible for the stores to the end of the buffer to still be in-flight in a store buffer in the core of the CPU running the producer thread by the time the consumer thread gets to the end of the buffer.

(in comments) by "subsequent" I mean happening later in time.

There's no way to make this happen unless you limit when those loads can be executed, by using something that synchronizes the producer thread with the consumer. As worded, you're asking for sfence to make NT stores globally visible the instant it executes, so that loads on other cores that execute 1 clock cycle after sfence will see the stores. A sane definition of "subsequent" would be "in the next thread that takes the lock this thread currently holds".

Fences stronger than sfence work, too:

Any atomic read-modify-write operation on x86 needs a lock prefix, which is a full memory barrier (like mfence).

So if you for example increment an atomic counter after your streaming stores, you don't also need sfence. Unfortunately, in C++ std:atomic and _mm_sfence() don't know about each other, and compilers are allowed to optimize atomics following the as-if rule. So it's hard to be sure that a locked RMW instruction will be in exactly the place you need it in the resulting asm.

(Basically, if a certain ordering is possible in the C++ abstract machine, the compiler can emit asm that makes it always happen that way. e.g. fold two successive increments into one +=2 so that no thread can ever observe the counter being an odd number.)

Still, the default mo_seq_cst prevents a lot of compile-time reordering, and there's not much downside to using it for a read-modify-write operation when you're only targeting x86. sfence is quite cheap, though, so it's probably not worth the effort trying to avoid it between some streaming stores and an locked operation.

Related: pthreads v. SSE weak memory ordering. The asker of that question thought that unlocking a lock would always do a locked operation, thus making sfence redundant.

C++ compilers don't try to insert sfence for you after streaming stores, even when there are std::atomic operations with ordering stronger than relaxed. It would be too hard for compilers to reliably get this right without being very conservative (e.g. sfence at the end of every function with an NT store, in case the caller uses atomics).

The Intel intrinsics predate C11 stdatomic and C++11 std::atomic. The implementation of std::atomic pretends that weakly-ordered stores didn't exist, so you have to fence them yourself with intrinsics.

This seems like a good design choice, since you only want to use movnt stores in special cases, because of their cache-evicting behaviour. You don't want the compiler ever inserting sfence where it wasn't needed, or using movnti for std::memory_order_relaxed.

score -1 · Answer 2 · answered Jul 03 '17 at 18:34

-1

But will my recent stores of the current thread be visible to subsequent load instructions too (in the other threads)? Or do I have to call _mm_mfence()? (The latter seems to be slow)

Answer is NO. You are not guaranteed to see previous stores in one thread without making any synchronization attempts in other thread. Why is that?

You compiler could reorder instructions
Your processor can reorder instructions (on some platforms)

In C++ compiler is required to emit sequentially consistent code but only for single threaded execution. So consider following code:

int x = 5;
int y = 7;
int z = x;

In this program compiler can chose to put x = 5 after y = 7 but no later as it will be inconsistent.
If you then consider following code in other thread

int a = y;
int b = x;

Same instruction reordering can happen here as a and b are independent of each other. What will be result of running those threads?

a    b
7    5
7    ? - whatever was stored in x before the assignment of 5
...

And this result we can get even if we put memory barrier between x = 5 and y = 7 because without putting barrier between a = y and b = x too you never know in which order they will be read.

This is just rough presentation of what you can read in Jeff Preshing's blog post Memory Ordering at Compile Time

answered Jul 03 '17 at 18:34

Marek Vitek

1,573
9
20

*In this program compiler can chose to put x = 5 after y = 7 but no later as it will be inconsistent.* No, as long as the compiler's asm output loads the old value of `x` before the `x=5` store, it can delay the `x=5` store as long as it wants (e.g. sink it out of a loop and keep the value of `x` live in a register (or as an immediate operand like `mov dword [x],5` if it's really a compile-time constant) , only storing the final value of `x` before returning). – Peter Cordes Jul 08 '17 at 03:10
*required to emit sequentially consistent code (for single-threaded execution)* is not a good way of describing things. The values in memory when a function returns have to match what the source code says. (after inlining and inter-procedural optimizations like optimizing away `static` variables whose address doesn't escape the compilation unit). The asm that achieves that result doesn't have to bear any resemblance to the order the C++ source does things in. – Peter Cordes Jul 08 '17 at 03:17
e.g. loop inversion optimization could write an array in row-major order even if the source says column-major. The compiler has to prove this is safe (e.g. any non-inline function calls that could have a pointer to the memory in question have to see the right values, as well as not changing the results of the function itself), but loop inversion is how some compilers "defeated" some of the benchmarks in SPECint or SPECfp (I forget which), making them trivial and meaningless. – Peter Cordes Jul 08 '17 at 03:21
Also note that `x = 5;` is a C++ assignment. Whether or not it compiles to an asm store instruction *anywhere* in your function depends on the surrounding code. Local variables with automatic storage can often stay in registers, or be optimized away entirely. – Peter Cordes Jul 08 '17 at 03:25
You are wrong, Compiler can't put `int x = 5;` after `int z = x;` . It wouldn't be consistent. And regarding rest of your comment - sequential consistency [Leslie Lamport, 1979] the result of any execution is the same as-if 1. the operations of all threads are executed in some sequential order 2. the operations of each thread appear in this sequence in the order specified by their program. - - So for single thread you can reorder as long as you maintain consistency with original code. More detailed information ca be found in §1.10 of c++ standard. – Marek Vitek Jul 08 '17 at 19:42
I'm talking about where the compiler stores to memory in its assembly output. Of course it still has to set `z = 5` if `x = 5` appeared first in the source. But it doesn't have to touch the memory for `x` until later (if `x` has a memory location at all), **because nothing else is allowed to observe the memory locations while this thread is running**, because `z` and `x` are `int`, not `std::atomic`. Reading `x` and `z` from another thread while they're being written is Undefined Behaviour in C++, which is what allows the compiler to store to memory in whatever order it wants. – Peter Cordes Jul 09 '17 at 03:42
I can prove it with a simple example: `void foo() {` `x = 5; z = x; x = 7; }` compiles to only two stores. The `x=5` store never appears in the asm output, because the compiler delays it until it can collapse with the `x=7` store. See gcc and clang asm output for x86-64 here: https://godbolt.org/g/H3aTKr The store to `z` of course stores `5`, because that's the current value of `x` at that point in program order. The compiler doesn't reorder the source itself, it re-orders the asm that implements a function that behaves the same as the source would, for a single thread. – Peter Cordes Jul 09 '17 at 03:46
@PeterCordes If you don't like my post, it's fine. But please stop making fool of yourself. With each post it is getting worse. Visibility of data in other thread has nothing to do with being declared as std::atomic. You have just proved nothing other than the fact that literal assignment can be optimized out. If you missed it the code I presented was to demonstrate need for memory fences in both threads and not only in one as OP assumed. In fact it wasn't code at all as in the way I wrote it it wouldn't even be visible in other thread as all variables are local automatic. Enjoy your day. – Marek Vitek Jul 09 '17 at 20:36

Make previous NT stores visible to subsequent memory loads in other threads

2 Answers2

Linked