Why atomic_thread_fence need atomic operation to work correctly

Question

c++ memory mode:

Establishes memory synchronization ordering of non-atomic and relaxed atomic accesses, as instructed by order, without an associated atomic operation. Note however, that at least one atomic operation is required to set up the synchronization, as described below.

The question is why one atomic operation is required

In my standing, atomic_thread_fence acts as a Load-Store-queue flushing, e.g. same with smp_rmb/smp_wmb in linux kernel.

So, it seems Ok for code like:

int i = 0, j = 0;

// cpu0:
i = 1;
atomic_thread_fence(memory_order_release);
j = 2;

// cpu1
int k = j;
atomic_thread_fence(memory_order_acquire);
if (k == 2) {
    assert(i == 1);
}

While, due the c++ memory ordering, it does not. So, the question is what is the difference between atomic_thread_fence with smp_rmb/smp_wmb.

As a comparison, the code is totally Ok:

int i = 0, j = 0;

// cpu0:
i = 1;
smp_wmb();
j = 2;

// cpu1
int k = j;
smp_rmb();
if (k == 2) {
    assert(i == 1);
}

In the code example you assume int reads and writes are atomic when updating and reading `j`. c++ standard says it does not have to, and the correct code should be at least something like: `std::atomic j; ... j.store(2, std::memory_order_relaxed); ... int k = j.load(std::memory_order_relaxed);` — dewaffled, Nov 10 '22 at 12:18
I think they want you to keep in mind, that the `atomic_thread_fence` alone is not a synchronization mechanism. You still need an atomic operation for that. — Jakob Stark, Nov 10 '22 at 12:24
Fences aren't a blanket exemption from data race rules. The fence would protect you from the data race on `i`, but you still have a data race on `j` and so your code has undefined behavior. And there is no way to remove all the data races without having an atomic variable (or mutex) somewhere. — Nate Eldredge, Nov 10 '22 at 15:03
The Linux kernel's use of atomics predates C11/C++11, so it does not follow the same rules. Instead, it relies on different guarantees that are only provided by gcc, and are not really well documented. (Or rather, it was written based on observing how gcc generated code in practice, and by hoping that those principles would continue to hold.) As a result the kernel cannot safely be compiled with any compiler other than gcc. — Nate Eldredge, Nov 10 '22 at 15:08
@Nate Eldredge What do you mean by 'your code has undefined behavior'? Yes, you can say j is not written as an atomic, but it needn't have to. The logic is if k == 2, then assert(i == 1), if K is something else, whatever, then i is not assured to be 1 ---- just like smp_rmb/wmb. — user2256177, Nov 11 '22 at 01:21
For the ISO C++ standard to guarantee anything about what happens in your program, yes, `j` *does* need to be written and read atomically. Otherwise, if the two threads do run at nearly the same time, your program has data race undefined behaviour, and literally nothing is guaranteed about anything that happens in your program before or after that point. For example, if run under `clang -fsanitize=thread`, it might crash with an exception, as allowed by ISO C++. — Peter Cordes, Nov 11 '22 at 01:33
If you're talking about some specific compiler like GCC, then you need to say so; only then can you talk about that compiler's memory-model rules, and the fact that `atomic_thread_fence(memory_order_release);` fully blocks compile-time reordering (I think?) in both directions even for non-atomic variables, like `asm("" ::: "memory")` does, even when there are no `__atomic_store()` operations involved. If that's true, then you also need to ask whether that's merely an implementation detail, or something that's guaranteed / supported by the GCC or libstdc++ docs for future versions. — Peter Cordes, Nov 11 '22 at 01:36
See also https://lwn.net/Articles/793253/ - *Who's afraid of a big bad optimizing compiler?* re: why it's not always safe to *just* use `smp_rmb()` without using READ_ONCE or WRITE_ONCE accessor macros in Linux kernel code. With complex surrounding code, it might manage to break this by inventing loads. Probably not, though, but it's still a very good idea to make that racy access `volatile` (via those Linux macros). — Peter Cordes, Nov 11 '22 at 01:41
@ Peter Cordes Can you explain more about ' Otherwise, if the two threads do run at nearly the same time, your program has data race undefined behaviour'; For the specific code, what the UB rises? int k = j, the k can be anything, that is expected; if (k == 2) ? k is a int, and int == value is not an UB; assert (i == 1)? if that is an UB, it means if k == 2, then i hasn't to be 1, I can't understand this, because that means atomic_thread_fence(memory_order_acquire) is a NOP. — user2256177, Nov 11 '22 at 01:52
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249498/discussion-between-user2256177-and-peter-cordes). — user2256177, Nov 11 '22 at 01:57
Because ISO C++ clearly and explicitly says so. See quotes from and links to the standard in another answer: [Is it a data race?](https://stackoverflow.com/a/71897193). (This rule means that C++ *can* run on machines with hardware race detection, among other things. And that C++ can do optimizations that assume other threads aren't modifying variables, like optimize `while(!stop_now){...}` into `if(!stop_now) while(1){...}`, which is useful for non-shared variables. They chose not to have memory barriers override this because there isn't much point, just use `atomic` with `relaxed`. — Peter Cordes, Nov 11 '22 at 02:01
C++ isn't portable assembly language; the formal ISO C++ memory model is based on establishing happens-before relationship, not on local reordering within threads. With enough compiler guarantees, it can kind of work like portable assembly language, like the ones you get from `asm("" ::: "memory")` barriers in GNU C++, if you're careful. But ISO C++ isn't specified that way. — Peter Cordes, Nov 11 '22 at 02:08
@user2256177: The point is that "undefined behavior" means the bad effects of a data race are **not** limited to simply reading the wrong value. A compiler is allowed to optimize in ways that would break this code in any imaginable or unimaginable way. For example, *because* unsynchronized concurrent access is defined as a data race and UB, the compiler can assume it never happens. So if `cpu1` executes in a loop, the compiler can decide that re-checking the value of `j` is redundant: cpu1 doesn't write to it, and if any other thread did so, it would be a data race, which is illegal. — Nate Eldredge, Nov 11 '22 at 14:56
Therefore, concludes the compiler, nobody is writing to `j` at all, and so there is no point in re-reading it, since its value can't have changed. If the compiler can tell that the value of `j` was not 2 on entry to the loop, it might remove the `if (k == 2)` branch altogether as dead code. — Nate Eldredge, Nov 11 '22 at 14:56
For another example, look at the third code snippet [here](https://stackoverflow.com/questions/71866535/which-types-on-a-64-bit-computer-are-naturally-atomic-in-gnu-c-and-gnu-c-m/71867102#71867102), with a so-called "invented load". In the presence of a data race, you can get "impossible" behavior, where a local variable (which has never been passed outside the function) appears to spontaneously change its value. — Nate Eldredge, Nov 11 '22 at 15:02

Why atomic_thread_fence need atomic operation to work correctly

0 Answers0