Can I not use release and acquire barriers in this case?

Question

This tutorial says the following:

every load on x86/64 already implies acquire semantics and every store implies release semantics.

Now say I have the following code (I wrote my questions in the comments):

/* Global Variables */

int flag = 0;
int number1;
int number2;

//------------------------------------

/* Thread A */

number1 = 12345;
number2 = 678910;
flag = 1; /* This is a "store", so can I not use a release barrier here? */

//------------------------------------

/* Thread B */

while (flag == 0) {} /* This is a "load", so can I not use an acquire barrier here? */
printf("%d", number1);
printf("%d", number2);

You got decent help before, it is not clear why you persist in trying to do this wrong. — Hans Passant, May 22 '18 at 17:51
@Hans Passant I don't want to do it like this, I am just trying to learn about this stuff (memory ordering, memory barriers, etc.). — James, May 22 '18 at 17:53
You don't need a release *barrier*; `flag.store(1, std::memory_order_release)` (or the C11 stdatomic equivalent) would be sufficient (if you used `atomic_int flag`). Related: http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect. With flag as a plain `int`, not `atomic_int`, no amount of barriers can make this legal C11, because you're reading it while it's being written by another thread. Legacy pre-C11 code would have put some kind of memory barrier inside the loop as a hack to force the compiler to reload the value from memory even for a plain `int`. — Peter Cordes, May 22 '18 at 23:16
You still have compiler-related issues that must be considered: 1- Making sure that the compiler allocates the variables from memory and never caches them in registers. 2- Ensuring that particular accesses to the variables are atomic. 3- Ensuring that the compiler does not reorder the accesses. All of these issues are language- and compiler-related; they are not directly caused by x86-64. I may come back after some time and write an answer in case no one else answered. — Hadi Brais, May 23 '18 at 01:37

score 3 · Answer 1 · answered May 23 '18 at 02:00

3

The tutorial is talking about loads and stores at the assembly/machine level, where you can rely on the x86 ISA semantics which include acquire-on-load and release-on-store.

Your code, however, is C which provides no such guarantees at all, and the compiler is free to transform it to something entirely different than what you'd expect in terms of loads and stores. This isn't theoretical, it happens in practice. So the short answer is: no, it's not possible to do that portably, legally in C - although it might work if you get lucky.

answered May 23 '18 at 02:00

BeeOnRope

60,350
16
207
386

Let's say that I used `volatile` on the global variables (this way the compiler will keep the global variables in memory instead of optimizing them into registers), and let's say that I also used compiler barriers to prevent compiler reordering. Will my code now work the way I expected? – James May 23 '18 at 05:09
1

Of course, it may again work by luck (with increasing probability), but `volatile` isn't linked with multi-threaded access in the C11 memory model, so you'll still be in undefined behavior land. It's worth noting that prior to the C11 memory model, the standard was just silent on this issue, so the only way to do things was with `volatile` (sometimes, often you don't need it since an opaque function call has the desired effect on the compiler) and non-portable memory barriers (in addition to non-portable atomic operations, of course). – BeeOnRope May 23 '18 at 05:36
You might be better off asking this question in assembly, or removing the `C` tag and explaining that this is just `C`-like pseudo-code that compiles down, line-by-line to what you'd expect at the assembly level if you don't know or don't want to write the example in assembly. @James – BeeOnRope May 23 '18 at 05:37
I have read that C11 knows about threads, and so you can write a multithreaded program in C11 and compile it on whatever CPU architecture you want, and the program's behavior will remain the same. But let's say that I am using a pre-C11 C standard, and I want to write a multithreaded program, in this case I have to think in terms of the CPU architecture's memory model that the compiler is running on, correct? Now let's say that I compiled my code under x86 using C99, wouldn't my code now work the way I expected (since x86 guarantees acquire-on-load and release-on-store)? – James May 23 '18 at 12:14
@James: No, because you're still writing it in C so C's weak memory model applies for compile-time reordering. (Or non-existent / de-facto memory model if you compile as C99.) [GCC's reordering of read/write instructions](https://stackoverflow.com/q/22106843) / [Is there any compiler barrier which is equal to asm("" ::: "memory") in C++11?](https://stackoverflow.com/q/40579342) – Peter Cordes May 23 '18 at 13:05
@Peter Cordes What if I used compiler barriers, would my code now work as I expected (I'm still talking about compiling using C99 under x86). – James May 23 '18 at 15:23
1

@James: yes, barriers (or compiler built-in functions like `__sync_compare_and_swap`) are how pre-C11 lock-free code was done, back in the bad old days before compilers understood atomic ops and let you tell them exactly what you wanted using C11 stdatomic. – Peter Cordes May 23 '18 at 15:29

Hadi Brais · Answer 2 · 2018-07-12T11:34:55.770

1

Let's assume that the code is written in C99 and the target architecture is x86. The strong memory model of x86 takes effect only at the machine code level. C99 doesn't have a memory model. I'll explain what can go wrong and discuss whether there is a C99-compliant way of handling the issues.

First, we have to make sure that none of the variables get optimized away and that all accesses to flag, number1, and number2 occur from memory rather than cached in CPU registers¹. This can be achieved in C99 by qualifying all of the three variables with volatile.

Second, we have to ensure that the store to flag in the first thread has release semantics. These semantics include two guarantees: the store to flag does not get reordered with previous memory accesses and making the store visible to the second thread. The volatile keyword tells the compiler that accesses to the variable may have observable side effects. This prevents the compiler from reordering accesses to volatile variables with respect to other operations that are also considered to have observable side effects by the compiler. That is, by making all of the three variables volatile, the compiler will maintain the order of all the three stores in the first thread. That said, if there are other non-volatile memory accesses that are above or below the store to flag, then such accesses can still be reordered. So the standard volatile provides only partial release semantics.

Third... actually for your particular piece of code, atomicity is not required. That's because the store to flag only changes one bit, which is inherently atomic. So for this particular code, you don't have to worry about atomicity. But in general, if the store to flag may change more than one bit and if the condition checked in the second thread may behave differently depending on whether it sees all or some of the bit changes, then you'd certainly need to ensure that accesses to 'flag` are atomic. Unfortunately, C99 has no notion of atomicity.

To get full release semantics and atomicity, you can either use C11 atomics (as discussed in the article you cited) or you can resort to compiler-specific techniques (also discussed in the article you cited). Of course, you can still just look at the generated machine code and see whether the x86 memory model itself offers the necessary requirements for correctness. This is not feasible on large code bases. In addition, the next time the code is compiled, the generated machine code may change. Finally, since you're merely a human, you may make a mistake.

(1) In the cited article, the variable A is declared as a shared global variable. Now most probably the compiler will allocate it from memory. But is this strictly standard compliant? What prevents the compiler from allocating it in a register for the whole lifetime of the program? Not sure about that.

edited Jul 12 '18 at 11:34

answered May 23 '18 at 17:45

Hadi Brais

22,259
3
54
95

*"if there are other non-volatile memory accesses that are above or below the store to flag, then such accesses can still be reordered"* But if there are other non-volatile memory accesses that are above the store to `flag`, they will not be reordered to below the store to `flag`, since a store implies release semantics in x86, correct? – James May 23 '18 at 18:13
@James Yes, at the x86 machine code level. But this is C code, so no. – Hadi Brais May 23 '18 at 18:15
You're right, I re-read your answer, I didn't pay attention the first time I read it that you were talking about the compiler doing the reordering. – James May 23 '18 at 18:43
With regard to your paragraph about atomicity, are you saying that a store to an `int` (which is 32-bit in size) under x86 is not atomic, and that if another thread is reading it, it can only read some of the bits changed? because I thought that only a `long long` (which is 64-bit in size) is not atomic under x86 (because it requires two store instructions). – James May 23 '18 at 18:50
@James But `int` may not be 32-bit in size. This is compiler-specific. Also it's alignment is compiler-specific as well. If the compiler being used implement `int` as 32-bit and if it is contained within a cache line (not crossing a cache line boundary), then x86-64 guarantees atomicity. The problem is really at the language level, not x86. – Hadi Brais May 23 '18 at 18:54
You say *Note that `volatile` does not tell the compiler to ensure that the store becomes immediately globally visible to other threads*. On x86 it does. C isn't really designed for machines that require explicit coherency, because there are no mechanisms to specify *which* earlier non-atomic stores need to be made globally visible. Anything stronger than relaxed or consume would always require a re-sync of everything with the global state. (I guess unless multithread-aware whole-program optimization proved that some stuff didn't need to be synced.) – Peter Cordes Jul 12 '18 at 10:29
Or was the emphasis on "immediate", as in not a barrier that makes this thread wait for earlier stores to become visible? – Peter Cordes Jul 12 '18 at 10:30
@PeterCordes Yeah. Specifically, `volatile` does not make the compiler emit `sfence` after each write to the volatile variable. – Hadi Brais Jul 12 '18 at 10:33
@HadiBrais: huh, so what? x86 doesn't need `sfence` for release semantics, unless you used NT stores. (And then you'd need it *before* the store to the `volatile`). With all 3 of the variables `volatile`, this would work on x86. (Because the compiler can't reorder accesses to `volatile` objects, and the HW provides sufficient runtime ordering for acq/rel). – Peter Cordes Jul 12 '18 at 10:42
*What prevents the compiler from allocating it in a register for the whole lifetime of the program?* Good question. If the compiler can't see the code for *all* functions (e.g. printf), it can't prove that no mutexes or atomic acq/rel operations took place, requiring other threads to see the value this thread assigned to a global. So I think the presence of any library (or other non-inline) function calls is what prevents whole-program optimization of a global into a register. – Peter Cordes Jul 12 '18 at 10:48
@PeterCordes Yes, I've edited the answer just to say that `sfence` is not required for release semantics. Regarding register allocation, makes sense. – Hadi Brais Jul 12 '18 at 11:10
The "immediately visible" wording is still confusing. If you're not talking about an explicit-coherency system (unlike normal CPUs), then it sounds like [If I don't use fences, how long could it take a core to see another core's writes?](https://stackoverflow.com/q/51292687) where you imply that fences speed up global visibility instead of actually making the current core/thread wait. The storing thread doesn't do any later loads, so even seq-cst (which would make the current thread wait) wouldn't make a difference. – Peter Cordes Jul 12 '18 at 11:28
@PeterCordes I removed the whole sentence. – Hadi Brais Jul 12 '18 at 11:35

Can I not use release and acquire barriers in this case?

2 Answers2