Sharing memory with the kernel and compiler optimizations

Question

a frame is shared with a kernel.

User-space code:

read frame  // read frame content
_mm_mfence  // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame

Kernel code:

poll for frame.status // reads a frame's status   
_mm_lfence

Kernel can poll it asynchronically, in another thread. So, there is no syscall between userspace code and kernelspace.

Is it correctly synchronized?

I doubt because of the following situation:

A compiler has a weak memory model and we have to assume that it can do wild changes as you can imagine if optimizied/changed program is consistent within one-thread.

So, on my eye we need a second barrier because it is possible that a compiler optimize out store frame.status, 0.

Yes, it will be a very wild optimization but if a compiler would be able to prove that noone in the context (within thread) reads that field it can optimize out it.

I believe that it is theoretically possibe, isn't it?

So, to prevent that we can put the second barrier:

User-space code:

read frame  // read frame content
_mm_mfence  // prevent before "releasing" a frame before we read everything.
frame.status = 0 // "release" a frame
_mm_fence

Ok, now compiler restrain itself before optimization.

What do you think?

EDIT

[The question is raised by the issue that __mm_fence does not prevent before optimizations-out.

@PeterCordes, to make sure myself: __mm_fence does not prevent before optimizations out (it is just x86 memory barrier, not compiler). However, atomic_thread_fence(any_order) prevents before reorderings (it depends on any_order, obviously) but it also prevents before optimizations out?

For example:

   // x is an int pointer
   *x = 5 
   *(x+4) = 6 
   std::atomic_thread_barrier(memory_order_release)

prevents before optimizations out of stores to x? It seems that it must- otherwise every store to x should be volatile. However, I saw a lot of lock-free code and there is no making fields as volatile.

What is the memory type from which `frame` is allocated? Is it WC? Are you using any instructions with WC semantics? Note that the `mfence` intrinsic is [more than just a compiler barrier](https://stackoverflow.com/questions/51251186/what-is-the-relationship-between-the-mm-sfence-intrinsic-and-a-sfence-instructi) and can be [expensive](https://stackoverflow.com/questions/50494658/are-loads-and-stores-the-only-instructions-that-gets-reordered/50496379#50496379). `lfence` can be even more expensive. So ideally you don't need to use them. — Hadi Brais, Jul 22 '18 at 19:32
You never need `_mm_lfence` unless you're using NT loads from WC memory. Use a Linux `smp_rmb` (read memory barrier) for synchronization between CPUs. On x86, it's only a compiler barrier, no asm instructions. I don't see why you need `mfence` either; loads can't reorder with *later* stores. mfence makes sense for `store; mfence; load` situations to make a thread wait for the store to become globally visible. — Peter Cordes, Jul 22 '18 at 19:33
Barriers are for ordering. Use `volatile` or `atomic` to prevent optimizing out loads/stores. Or in Linux, `read_once`. https://lwn.net/Articles/624126/. — Peter Cordes, Jul 22 '18 at 19:35
@PeterCordes, I see that you assume x86 memory model. Ok- I tagged my subject in that way. But, why don't you consider compiler's reordering? — Gilgamesz, Jul 22 '18 at 20:07
"loads can't reorder with later stores". Yes! But on `x86`. `read frame` can be reodered with `frame.status = 0` by a **compiler**. This is why I put a memory barrier. And I put a `mfence` because `sfence` or `lfence` does not solve a problem. — Gilgamesz, Jul 22 '18 at 20:21
"So, is my intuition correct that we have to use `std::atomic`, `volatile`, `ACCESS_ONCE` to make it correct?" - Yes, for prevent compiler to optimize out assignment to the memory's cell, you may use one of these methods. Note, that this is **unrelated** to `mfence` and. (But atomics may combine storing to the memory with a CPU barrier.). Also note, that compiler won't optimize out storing to the *global variable*; it can only defer this storing. So you should only bother in the case when `frame` is declared locally to the function, so compiler may "prove" that this storing isn't needed. — Tsyvarev, Jul 22 '18 at 20:29
@Tsyvarev, thanks for explanation. How can be you sure that compiler cannot prove something? Do you have proof that it can't? Peter Cordes, you are needed here, as well :) — Gilgamesz, Jul 22 '18 at 20:51
Of course I assumed x86. You used a non-portable `_mm_lfence()`. Anyway, what you actually need is a release barrier, the Linux kernel equivalent of `atomic_thread_fence(memory_order_release)`, to keep the loads ahead of the `frame.status = 0` store. Or does Linux provide a release-store function? I'm not that familiar with the details of the Linux kernel barriers / functions. IDK why you say I "don't consider compiler's reordering". `smp_rmb()` portably blocks that + runtime reordering. — Peter Cordes, Jul 22 '18 at 22:20
ok, @PeterCordes thanks for your response. So, `atomic_thread_fence(memory_order_release)` will be only a compiler barrier (it is no-op on `x86`). However, on arch with weak memory model it must be `mfence` (not `sfence` or `lfence`), yes? — Gilgamesz, Jul 23 '18 at 05:27
Do you know is it necessary to make a write `frame.status = 0` volatile/atomic excplicitly? Or we can assume: Ok, a compiler cannot optimize out it — Gilgamesz, Jul 23 '18 at 05:28
@Gilgamesz: no, on PowerPC it can be a `lwsync`. Light Weight Sync is a barrier for all reordering other than StoreLoad. On SPARC (non-TSO) a release barrier is #LoadStore and #StoreStore barriers. (In your case I think you only need #LoadStore, so if you wrote a SPARC version you could omit a #StoreStore barrier. x86's sfence is only StoreStore, so it wouldn't be sufficient if x86 didn't already block everything except StoreLoad all the time, but weakly-ordered ISAs have different fences than x86. — Peter Cordes, Jul 23 '18 at 11:59
I'd probably make the access volatile with `ACCESS_ONCE()`, because there's no reason not to. Better safe than sorry if you build with link-time optimization. @Tsyvarev suggestion to assume you can figure out what the optimizer will do sounds like a bad idea with LTO. Or even inlining within a translation unit, depending on what your functions do. — Peter Cordes, Jul 23 '18 at 12:02
@PeterCordes: Yes, while a global variable "resists" to compiler optimization, linker may still drop it during LTO. I agree that using volatile or `ACCESS_ONCE` seems the best choice. One of the problem with lock-free programming is that it is easy to say "Use A, it will work". But very often it is hard to say confidently whether "B will work" or "B won't work". — Tsyvarev, Jul 23 '18 at 13:21
@PeterCordes, thanks as always for great explanation. I get it! :) — Gilgamesz, Jul 23 '18 at 17:23
@PeterCordes, do you know how to store into **raw** pointer with volatile semantic? So, I have a `int* p = any_address` and I would like to store `int `as if I store volatile. (`volatile int*p`; is obviously volatile pointer, not volatile pointee) — Gilgamesz, Jul 24 '18 at 18:12
`volatile int*p` is a pointer-to-volatile, which does what you want. `int *volatile p` is a volatile pointer to regular `int`. — Peter Cordes, Jul 24 '18 at 18:13
yes, it was a bit quirky for me when I read today that volatile int*p is `volatile pointer` while as `const int* p` is pointer to a constant. But, I read today afternoon that `volatile int* p` is `volatile pointer` not pointer to volatile. But, it could be a nonsense. Thanks! :) — Gilgamesz, Jul 24 '18 at 18:20
`_mm_mfence` is *also* a compiler barrier. (See [When should I use \_mm\_sfence \_mm\_lfence and \_mm\_mfence](https://stackoverflow.com/a/50780314), and also BeeOnRope's answer there). `atomic_thread_fence` with release, rel_acq, or seq_cst stops earlier stores from merging with later stores. But `mo_acquire` doesn't have to. — Peter Cordes, Jul 29 '18 at 14:28
Probably, I don't grasp what do you mean exactly by merging. I see that `mo_release` stops earlier stores from reordering **after** a barrier. But, can they be **theoretically** optimized out? I suppose that not (it is what you mean by "merging", I think so) — Gilgamesz, Jul 29 '18 at 14:58

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

_mm_mfence is also a compiler barrier. (See When should I use _mm_sfence _mm_lfence and _mm_mfence, and also BeeOnRope's answer there).

atomic_thread_fence with release, rel_acq, or seq_cst stops earlier stores from merging with later stores. But mo_acquire doesn't have to.

Writes to non-atomic globals variables can only be optimized out by merging with other writes to the same non-atomic variables, not by optimizing them away entirely. So the real question is what reorderings can happen that can let two non-atomic assignments come together.

There has to be an assignment to an atomic variable in there somewhere for there to be anything that another thread could synchronize with. Some compilers might give atomic_thread_fence stronger behaviour wrt. non-atomic variables, but in C++11 there's no way for another thread to legally observer anything about the ordering of *x and x[4] in

#include <atomic>
std::atomic<int> shared_flag {0};
int x[8];

void writer() {
    *x = 0;
    x[4] = 0;
    atomic_thread_fence(mo_release);
    x[4] = 1;
    atomic_thread_fence(mo_release);
    shared_flag.store(1, mo_relaxed);
}

The store to shared_flag has to appear after the stores to x[0] and x[4], but it's only an implementation detail what order the stores to x[0] and x[4] happen in, and whether there are 2 stores to x[4].

For example, on the Godbolt compiler explorer gcc7 and earlier merge the stores to x[4], but gcc8 doesn't, and neither do clang or ICC. The old gcc behaviour does not violate the ISO C++ standard, but I think they strengthened gcc's thread_fence because it wasn't strong enough to prevent bugs in other cases.

For example,

void writer_gcc_bug() {
    *x = 0;
    std::atomic_thread_fence(std::memory_order_release);
    shared_flag.store(1, std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_release);
    *x = 2;  // gcc7 and earlier merge this, which arguably a bug
}

gcc only does shared_flag = 1; *x = 2; in that order. You could argue that there's no way for another thread to safely observe *x after seeing shared_flag == 1, because this thread writes it again right away with no synchronization. (i.e. data race UB in any potential observer makes this reordering arguably legal).

But gcc developers don't think that's enough reason, (it may be violating the guarantees of the builtin __atomic functions that the <atomic> header uses to implement the API). And there may be other cases where there is a real bug that even a standards-conforming program could observe the aggressive reordering that violated the standard.

Apparently this changed on 2017-09 with the fix for gcc bug 80640.

Alexander Monakov wrote:

I think the bug is that on x86 __atomic_thread_fence(x) is expanded into nothing for x!=__ATOMIC_SEQ_CST, it should place a compiler barrier similar to expansion of __atomic_signal_fence.

(__atomic_signal_fence includes something as strong as asm("" ::: "memory" ).)

Yup that would definitely be a bug. So it's not that gcc was being really clever and doing allowed reorderings, it was just mostly failing at thread_fence, and any correctness that did happen was due to other factors, like non-inline function boundaries! (And that it doesn't optimize atomics, only non-atomics.)

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 29 '18 at 15:02

Peter Cordes

328,167
45
605
847

"Writes to non-atomic globals variables can only be optimized out by merging with other writes to the same non-atomic variables, not by optimizing them away entirely.". In last week we said stores to glob. variables can be optimized away (elided) entirely (even during linking time). So, we concluded that `not-atomic store` to global variable should be `volatile` (if there is a concurrent observer). So, I understand that we were wrong and now, we know that writes to global variables **cannot** be optimized out. They can be merged, and `atomic_thread_fence(mo_rel)` prevents before it, yes? – Gilgamesz Jul 30 '18 at 05:34
*In last week we said stores to glob. variables can be optimized away (elided) entirely (even during linking time).* Yes, but only by folding store/reload/store or whatever; if a block of code assigns a global, then at least one store to the global will happen somewhere in the function. Unless whole-program link-time-optimization can prove that a global is never read anywhere, then I guess all stores to it could be optimized away (like you can easily see for a `static` variable in a single compilation unit.) This answer doesn't contradict comments on the question. – Peter Cordes Jul 30 '18 at 05:52
ok, so a compiler can only merge stores to global variables but linker is allowed to optimize out a `store` (because a linker can know if there is other translation unit that reads global variable or not). And this is why that answer does not contradict. Yes? – Gilgamesz Jul 30 '18 at 08:06
Ok, but it does not answer my primary (from an edition of my post) question. I see that `atomic_thread_fence(mo_release);` stops from merging those stores but I don't see whether a linker is allowed to optimize it out? So, again: `[plain_stores_to_non_global_non_atomic_variables]`, [`atomic_thread_fence(mo_release);`]. Is it possible that `plain_stores_to_non_global_non_atomic_variables` be elided by a linker? If so, does it mean that we have to be paranoic and every store should be `volatile` (I mean stores to variables that are shared between threads, perhaps between different processes). – Gilgamesz Jul 30 '18 at 08:16
It's not the linker itself that changes code at link time. If you build with link-time optimization, then your whole program is re-compiled at link-time, with cross-file inlining becoming possible. The machine code from the first round of compilation is ignored; only the GIMPLE or LLVM-IR in the object files is used. – Peter Cordes Jul 30 '18 at 15:16
@Gilgamesz: No, you don't have to be paranoid about non-atomic operations being optimized away, if they're part of a correct program that doesn't have any data races on non-atomic variables. i.e. write in one thread / read in another is done with something that establishes a happens-before relationship, e.g. a mutex or release/acquire synchronization between threads. In that case, any optimization that broke your program would not be allowed (except by a buggy compiler like maybe gcc7 and earlier with `thread_fence` in rare cases). – Peter Cordes Jul 30 '18 at 15:21
"(...) happens-before relationship, e.g. a mutex or release/acquire synchronization between threads (...)". Yes, it is obvious and the simplest situation. What about such situation? [plain_stores_to_non_global_non_atomic_variables], [atomic_thread_fence(mo_release);] – Gilgamesz Jul 30 '18 at 15:45
@Gilgamesz: If nobody can legally see the difference, it doesn't matter whether they're deferred until later or not. As an implementation detail, gcc8 and other compilers make `thread_fence` a full barrier, like `asm("" ::: "memory")`. But if we just look at the ISO standard, then any optimization that doesn't break your program is legal. Optimizations that *do* break a valid program with no UB are not legal. – Peter Cordes Jul 30 '18 at 15:48
Yes, you right. We assume only legal optimizations. The last week you said that it is safier to make non-atomic store volatile to prevent optimization away. Now, the situation is the same except that non-atomic store is **before** thread_fence(mo_relasae). (And then you did not\ argue that a such optimizations can break standard) – Gilgamesz Jul 30 '18 at 15:53

Sharing memory with the kernel and compiler optimizations

1 Answers1