0
    volatile bool b;
 
    
    Thread1: //only reads b
    void f1() {
    while (1) {
       if (b) {do something};
       else { do something else};
    }
    }
    
    Thread2: 
    //only sets b to true if certain condition met
    // updated by thread2
    void f2() {
    while (1) {
       //some local condition evaluated - local_cond
       if (!b && (local_cond == true)) b = true;
        //some other work
    }
    }
    
    Thread3:
    //only sets b to false when it gets a message on a socket its listening to
    void f3() {
    while (1) {
        //select socket
        if (expected message came) b = false;
        //do some other work
    }
    }

If thread2 updates b first at time t and later thread3 updates b at time t+5:

will thread1 see the latest value "in time" whenever it is reading b?

for example: reads from t+delta to t+5+delta should read true and reads after t+5+delta should read false.

delta is the time for the store of "b" into memory when one of threads 2 or 3 updated it

Youli Luo
  • 167
  • 1
  • 13
  • 9
    `volatile` isn't for threading. – Eljay Jul 23 '22 at 02:52
  • @Eljay - I am not doing any critical section here. This question is more about understanding of volatile variable when used from multiple threads or processes (linux/x86_64 c++) – Youli Luo Jul 23 '22 at 02:56
  • @Eljay: Note the [linux-kernel] tag. That project *does* successfully roll its own atomics using `volatile` (and inline asm for barriers for ordering.) But The Linux kernel is written in C, not C++, so the tags don't make sense here. – Peter Cordes Jul 23 '22 at 02:57
  • 1
    @YouliLuo: Nobody said anything about critical sections. C/C++ `volatile` isn't for lockless code either; that's what C++ `std::atomic` is for. [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) - basically never, use `std::atomic` with `std::memory_order_release` or `relaxed` if that's what you want. – Peter Cordes Jul 23 '22 at 02:59
  • @PeterCordes this is in pre C++11 environment – Youli Luo Jul 23 '22 at 03:00
  • @PeterCordes trying to make the volatile work in this scenario which is of a narrow scope under old C++ – Youli Luo Jul 23 '22 at 03:02
  • 1
    Stores on one core aren't visible instantly to other cores; after they execute, there's some latency before it actually commits to L1d cache. If `t+5` is 5 clocks or nanoseconds later, then inter-thread latency is significant on your timescale. But if it's like 5 seconds, then sure, volatile visibility is close enough to instantaneous. See the "latest value" section of [Is a memory barrier required to read a value that is atomically modified?](https://stackoverflow.com/q/71718224) – Peter Cordes Jul 23 '22 at 03:08
  • 1
    And see the coherency guarantees for atomics in the C++ standard (https://eel.is/c++draft/intro.races#19); `volatile` store works like atomic<> with `relaxed` when compiling for normal ISAs, including x86-64. (You sometimes get acquire/release but there's no guarantee of avoiding compile-time reordering. So it's not safe to be doing `!b && a == 100` if another thread could be modifying `a` and then writing b to signal it was done. Or if the comment in the code is right that `a` is only modified by that thread, then the if condition might never fire.) – Peter Cordes Jul 23 '22 at 03:11
  • @PeterCordes yes only thread2 can. change a, lets say it is doing some computation which is not written in function above to get value of a, so a==100 is definitely possible and like you said with sufficient latency between updates done by threads 2 and 3, thread2 can see the change of thread3, do you know what is typical latency of one thread to see change done by another thread to a one byte variable, both running on same core and also different core? – Youli Luo Jul 23 '22 at 03:18
  • @YouliLuo - *this is in pre C++11 environment* -- [boost atomic](https://www.boost.org/doc/libs/1_79_0/libs/atomic/doc/html/index.html). This works for pre C++11 environments. – PaulMcKenzie Jul 23 '22 at 03:18
  • @PaulMcKenzie this is a low latency environment..we avoid boost – Youli Luo Jul 23 '22 at 03:19
  • 1
    @YouliLuo Well, boost comes with source code. You can see how they achieve this. Might as well get the answer you're looking for by seeing what boost does. – PaulMcKenzie Jul 23 '22 at 03:20
  • 2
    Short answer is "no". The behaviour you're seeking requires some element of synchronisation between threads that access variables in common or atomicity of operations, and `volatile` supports neither. All `volatile` does is tell the compiler that a variable may be modified in some way that is not visible to the compiler - and typically that affects ability to optimise/reorder code and instructions. It doesn't ensure things like (for example) threads that read a variable receiving meaningful values if they pre-empt a thread that is changing value of that variable. – Peter Jul 23 '22 at 03:21
  • 1
    And in general with boost, if the boost library has an implementation of and you are trying to implement , nothing stops you from looking at how boost implements , and then simply mimic or gain better knowledge of how to achieve the goal. – PaulMcKenzie Jul 23 '22 at 03:24
  • 1
    @PaulMcKenzie he isn't asking about boost, or if the code is correct he is asking if the scenario works. He probably already knows about atomics. – Jesse Taube Jul 23 '22 at 03:26
  • I know he isn't talking about boost. What I am saying is to see how others have already achieved what he is attempting. – PaulMcKenzie Jul 23 '22 at 03:28
  • 1
    If you insist on avoiding standardized language features that are over a decade old, you can roll your own with GNU C `__atomic_load_n(&b, __ATOMIC_ACQUIRE)`. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. Or you can depend on hacks with `volatile` that happen to work for some compilers, but give you no control over memory ordering wrt. other operations, i.e. basically just `__ATOMIC_RELAXED`. Of course all of these will compile to the same x86-64 asm in *this* case, giving the same actual behaviour. – Peter Cordes Jul 23 '22 at 03:29
  • *so a==100 is definitely possible* - Sure, but if you enter the loop with `a` == any other value, the if is never true. So it's a really weird loop, IDK why that condition is part of it; you might as well just not run the loop at all if `a` isn't `100`. That's what I was getting at in my earlier comment. And all your loops are infinite; they don't leave the loop after seeing the condition they're looking for and updating `b`. – Peter Cordes Jul 23 '22 at 03:32
  • 2
    If `a` is only used in f2 why is it in the struct. – Jesse Taube Jul 23 '22 at 03:33
  • @JesseTaube lets say a needs persistence (in /dev/shm) and b is something shared between threads.. – Youli Luo Jul 23 '22 at 03:36
  • @PeterCordes. the infinite loop is just for illustration that a thread is running a loop and doing something through out the day until asked to shutdown...in each iteration it gets value for a from a local computation – Youli Luo Jul 23 '22 at 03:38
  • 1
    It's super confusing to figure out what your example is supposed to be showing if it doesn't include at least a comment *inside* the loop showing that's where `a` gets updated. Otherwise the obvious assumption is that it's a spin-wait loop that really is empty. Except buggy or badly-constructed example because it doesn't exit. So it's not illustrating what you intended. Also, `a` being non-shared data memory-mapped in /dev/shm doesn't explain why it would be in a struct with a `volatile bool` that *is* shared. – Peter Cordes Jul 23 '22 at 03:59
  • 1
    More importantly, why is `a` part of this example *at all*? Surely a real loop that checks `b` occasionally would also have lots of other variables to do its work, but you're not showing any of them either. If `a` isn't special in any way in terms of threading, don't show it. – Peter Cordes Jul 23 '22 at 04:01
  • @PeterCordes updated...so it looks like threads 2 and 3 should do something like atomic_load variable b as ACQUIRE - update b - atomic store variable b as RELEASE and b has to be volatile. so thread1 knows it is changed outside the thread – Youli Luo Jul 23 '22 at 04:11
  • @PeterCordes Due to quirks in the ARM design the delay before a write can become visible on another core can be forever. Thread1: `flag1 = 1; while(flag2 == 0) { }` Thread2: `flag2 = 1; while(flag1 == 0) { }` can stay stuck forever barring interrupts causing the write to be flushed. `volatile` only instructs the compiler to generate code to read/write a variable. It does not prevent the hardware from optimizing, reordering and buffering further. – Goswin von Brederlow Jul 23 '22 at 06:57
  • @GoswinvonBrederlow: Last time you made that claim, you were at least justifying the claim that a store might never leave the store buffer by saying one thread was continuously writing the same value, so it would keep coalescing, IIRC. Now your claim just seems implausible. AFAIK, all real CPUs try to commit stores from the store buffer if there are any outstanding, to make room for future bursts of stores. I'm not going to believe that's possible on any real CPU unless you can demonstrate it with an experiment. But we shouldn't debate this again here since this is an [x86-64] question. – Peter Cordes Jul 23 '22 at 07:24

3 Answers3

3

The effect of volatile keyword is principally two things (I avoid scientifically strict formulations here):

​1) Its accesses can't be cached or combined. (UPD: on suggestion, I underline this is for caching in registers or another compiler-provided location, not the RAM cache in CPU.) For example, the following code:

x = 1;
x = 2;

for a volatile x will never be combined into single x = 2, whatever optimization level is required; but if x is not volatile, even low levels will likely cause this collapse into a single write. The same for reads: each read operation will access the variable value without any attempt to cache it.

​2) All volatile operations are relayed onto machine command layer in the same order between them (to underline, only between volatile operations), as they are defined in source code.

But this is not true for accesses between non-volatile and volatile memory. For the following code:

int *x;
volatile int *vy;
void foo()
{
  *x = 1;
  *vy = 101;
  *x = 2;
  *vy = 102;
}

gcc (9.4) with -O2 and clang (10.0) with -O produce something similar to:

        movq    x(%rip), %rax
        movq    vy(%rip), %rcx
        movl    $101, (%rcx)
        movl    $2, (%rax)
        movl    $102, (%rcx)
        retq

so one access to x is already gone, despite its presence between two volatile accesses. If one need the first x = 1 to succeed before first write to vy, let him put an explicit barrier (since C11, atomic_signal_fence is the platform-independent mean for this).


That was the common rule but without regarding multithread issues. What happens here with multithreading?

Well, imagine as you declare that thread 2 writes true to b, so, this is writing of value 1 to single-byte location. But, this is ordinary write without any memory ordering requirements. What you provided with volatile is that compiler won't optimize it. But what for processor?

If this was a modern abstract processor, or one with relaxed rules, like ARM, I'd say nothing prevent it from postponing the real write for an indefinite time. (To clarify, "write" is exposing the operation to RAM-and-all-caches conglomerate.) It's fully up to processor's deliberation. Well, processors are designed to flush their stockpiling of pending writes as fast as possible. But what affects real delay, you can't know: for example, it could "decide" to fill instruction cache with a few next lines, or flush another queued writings... lots of variants. The only thing we know it provides "best effort" to flush all queued operations, to avoid getting buried under previous results. That's truly natural and nothing more.

With x86, there is an additional factor. Nearly every memory write (and, I guess, this one as well) is "releasing" write in x86, so, all previous reads and writes shall be completed before this write. But, the gut fact is that the operations to complete are before this write. So when you write true to volatile b, you will be sure all previous operations have already got visible to other participants... but this one still could be postponed for a while... how long? Nanoseconds? Microseconds? Any other write to memory will flush and so publish this write to b... do you have writes in cycle iteration of thread 2?

The same affects thread 3. You can't be sure this b = false will be published to other CPUs when you need it. Delay is unpredictable. The only thing is guaranteed, if this is not a realtime-aware hardware system, for an indefinite time, and the ISA rules and barriers provide ordering but not exact times. And, x86 is definitely not for such a realtime.


Well, all this means you also need an explicit barrier after write which affects not only compiler, but CPU as well: barrier before previous write and following reads or writes. Among C/C++ means, full barrier satifies this - so you have to add std::atomic_thread_fence(std::memory_order_seq_cst) or use atomic variable (instead of plain volatile one) with the same memory order for write.

And, all this still won't provide you with exact timings like you described ("t" and "t+5"), because the visible "timestamps" of the same operation can differ for different CPUs! (Well, this resembles Einstein's relativity a bit.) All you could say in this situation is that something is written into memory, and typically (not always) the inter-CPU order is what you expected (but the ordering violation will punish you).


But, I can't catch the general idea of what do you want to implement with this flag b. What do you want from it, what state should it reflect? Let you return to the upper level task and reformulate. Is this (I'm just guessing on coffee grounds) a green light to do something, which is cancelled by an external order? If so, an internal permission ("we are ready") from the thread 2 shall not drop this cancellation. This can be done using different approaches, as:

​1) Just separate flags and a mutex/spinlock around their set. Easy but a bit costly (or even substantially costly, I don't know your environment).

​​2) An atomically modified analog. For example, you can use a bitfield variable which is modified using compare-and-swap. Assign bit 0 to "ready" but bit 1 for "cancelled". For C, atomic_compare_exchange_strong is what you'll need here at x86 (and at most other ISAs). And, volatile is not needed anymore here if you keep residing with memory_order_seq_cst.

Netch
  • 4,171
  • 1
  • 19
  • 31
  • @PeterCordes Right you are, I edited it a few times and got a wrong example after rechecking optimization levels. I have added just now example with `clang` and `-O2`, it really avoids the first write. – Netch Jul 25 '22 at 07:13
  • 1
    Ok, that makes more sense, yeah https://godbolt.org/z/TvnavhhTY confirms that GCC and clang -O2 do dead-store elimination on that source. (As well as eliminating one read each of the global vars, reusing the same pointers in registers because they can be sure that an `int` store doesn't modify an `int *` object.) (Correct, alias analysis isn't the important thing; that's why I deleted my previous comments.) – Peter Cordes Jul 25 '22 at 07:15
  • @PeterCordes I guess alias analysis isn't the source here, just my mistake from hurrying. Both gcc and clang drop first write with `-O2`. The code in initial version was from clang with `-Og` and I was confused when `lea` was dropped from each operation but not the first write. – Netch Jul 25 '22 at 07:16
  • When you say "can't be cached", it would be good to say "can't be cached *in registers*". There's a not-uncommon misconception that volatile bypasses (coherent) CPU caches, not just software caching. Probably from people misunderstanding what was meant from ambiguous explanations like this. The thing that's actually called a cache does *not* have to be bypassed or worked around. – Peter Cordes Jul 25 '22 at 07:18
  • @PeterCordes Thanks, adding note what caching is exactly meant. – Netch Jul 25 '22 at 07:23
  • 2
    *I was told that x86 CPUs flush any write in no more than 10 ns.* - That sounds like *best* case for draining the store buffer, if they all hit in L1d. As you say, x86 is strongly ordered, so a store can't commit until previous stores commit. If you just executed a bunch of scattered stores whose RFOs (Read For Ownership) will miss, another store after that will be waiting for at least DRAM access latency before it can commit. Longer if there are more cache-miss stores than LFBs to track incoming lines, or other demand for memory bandwidth, so the RFOs can't all run in parallel. – Peter Cordes Jul 25 '22 at 07:26
  • 2
    Contention from multiple other cores for ownership of a line could of course delay it further. I guess that 100 ns is probably enough for most stores to become visible (as a loose upper bound in cases without high contention). With contention, probably possible to get a delay like 1 microsecond if you really try to create a bad case with scattered stores. See [MESI Protocol & std::atomic - Does it ensure all writes are immediately visible to other threads?](https://stackoverflow.com/q/60292095). Perhaps 10 ns to retire a store? A cache-miss load could stall it. – Peter Cordes Jul 25 '22 at 07:30
  • @Netch Thanks..I am on x86_64/Linux...b is a bool that needs to be visible to another process which. is latency sensitive..so no mutex/spinlocks...also on old C++(pre-Cpp11)..so want to work with volatile and add memory barrier/fences as needed...the change made to b by thread2 or thread3 should be visible quickly to this low latency process and also ideally thread2/3 themselves should see the change quickly...also thread2 change to b should not be overwritten by thread3 or viceversa before it is visible to them (as long as they are scheduled running at that "time") – Youli Luo Jul 25 '22 at 14:36
  • @YouliLuo Before adding atomics to C++11, there were lots of 3rd party libraries for the same. You might start with Boost.Atomic. You might also implement own library by wrapping inline assembler with CMPXCHG. For proper use of volatile, check https://www.kernel.org/doc/Documentation/memory-barriers.txt and look into listed macros implementation. Well, to push all changes quickly you'll need calling MFENCE - also, via atomic library call or inline assembler. – Netch Jul 26 '22 at 19:08
1

Will thread1 see the latest value "in time" whenever it is reading b?

Yes, the volatile keyword denotes that it can be modified outside of the thread or hardware without the compiler being aware thus every access (both read and write) will be made through an lvalue expression of volatile-qualified type is considered an observable side effect for the purpose of optimization and is evaluated strictly according to the rules of the abstract machine (that is, all writes are completed at some time before the next sequence point). This means that within a single thread of execution, a volatile access cannot be optimized out or reordered relative to another visible side effect that is separated by a sequence point from the volatile access.

Unfortunately, the volatile keyword is not thread-safe and operation will have to be taken with care, it is recommended to use atomic for this, unless in an embedded or bare-metal scenario.

Also the whole struct should be atomic struct X {int a; volatile bool b;};.

Jesse Taube
  • 402
  • 3
  • 11
  • this is pre c++11 environment, also only variable b is read or written by other threads. variable a is read/write by only one thread – Youli Luo Jul 23 '22 at 03:25
  • 2
    There are compiler-specific atomics. – Jesse Taube Jul 23 '22 at 03:31
  • 2
    "Volatile access cannot be optimized out or reordered relative to another" volatile access, but there is no such requirement for any non-volatile access, and reordering might happen. If ordering is needed between non-volatile and volatile accesses, compiler barrier (as `atomic_signal_fence`) shall be added. Please update your answer. – Netch Jul 25 '22 at 06:06
1

Say I have a system with 2 cores. The first core runs thread 2, the second core runs thread 3.

reads from t+delta to t+5+delta should read true and reads after t+5+delta should read false.

Problem is that thread 1 will read at t + 10000000 when the kernel decides one of the threads has run long enough and schedules a different thread. So it likely thread1 will not see the change a lot of the time.

Note: this ignores all the additional problems of synchronicity of caches and observability. If the thread isn't even running all of that becomes irrelevant.

Goswin von Brederlow
  • 11,875
  • 2
  • 24
  • 42
  • Let's say thread has a core dedicated and it is running all the time, then on x86_64 strong memory ordered system what happens – Youli Luo Jul 24 '22 at 05:26
  • something. Or something else. Who knows how long those functions take. The timing of each action is independent. Anything can happen. – Goswin von Brederlow Jul 24 '22 at 05:40