15

I've seen some flavors of these question around and I've seen mixed answers, still unsure whether they are up-to-date and fully apply to my use case, so I'll ask here. Do let me know if it's a duplicate!

Given that I'm developing for STM32 microcontrollers (bare-metal) using C++17 and the gcc-arm-none-eabi-9 toolchain:

Do I still need to use volatile for sharing data between an ISR and main()?

volatile std::int32_t flag = 0;

extern "C" void ISR()
{
    flag = 1;
}

int main()
{
    while (!flag) { ... }
}

It's clear to me that I should always use volatile for accessing memory-mapped HW registers.

However for the ISR use case I don't know if it can be considered a case of "multithreading" or not. In that case, people recommend using C++11's new threading features (e.g. std::atomic). I'm aware of the difference between volatile (don't optimize) and atomic (safe access), so the answers suggesting std::atomic confuse me here.

For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile.

In other words: can the compiler know that flag can change inside ISR? If not, how can it know it in regular multithreaded applications?

Thanks!

user1011113
  • 1,114
  • 8
  • 27
  • 1
    You've to use `volatile` to tell the compiler that inside main `flag` might get changed without notice by the compiler. `std::atomic` is also fine but in this case it's not really needed. – HS2 Aug 18 '20 at 16:26
  • 1
    @HS2: When using clang/gcc, if one doesn't use either atomic or a clang/gcc "__asm" intrinsic, operations on the `volatile` data-ready flag might get reordered with respect to operations on the buffer the flag was being used to guard. – supercat Aug 18 '20 at 16:53
  • @supercat That’s right and reordering is not covered by the standard, only _sequential consistency_. But if I’m not wrong that wasn’t the original question. – HS2 Aug 19 '20 at 19:29
  • @supercat And yes, when it comes to, say, semaphore/mutex semantics, potential reordering, speculative execution and prefetching have to be taken into account. – HS2 Aug 19 '20 at 19:39
  • @HS2: On a single core system, when using a compiler that treats `volatile` as a global barrier to compiler reordering, `volatile` will work reliably for coordinating actions with ISRs. When using clang and gcc, `volatile` semantics are too weak to be suitable for that purpose without also using memory-clobber intrinsics. – supercat Aug 19 '20 at 19:59
  • There is also the standard `sig_atomic_t` which is `the (possibly volatile-qualified) integer type of an object that can be accessed as an atomic entity, even in the presence of asynchronous interrupts`. – KamilCuk Aug 20 '20 at 13:49
  • @KamilCuk: That tends to be of somewhat limited usefulness, since most implementations can offer semantic guarantees that are stronger than what the Standard requires, and many tasks would be impractical, if not outright impossible, without such guarantees. – supercat Aug 20 '20 at 20:14
  • *For the case of "real" multithreading on x86 systems I haven't seen the need to use volatile.* Huh? Your code with a `stop_running` flag is a textbook example of code that breaks with `-O2` with the flag-setting done from another thread. [Multithreading program stuck in optimized mode but runs normally in -O0](//stackoverflow.com/q/58516052) / [MCU programming - C++ O2 optimization breaks while loop](//electronics.stackexchange.com/a/387478) . You need `std::atomic` (optionally with `std::memory_order_relaxed`), or for sig/int handlers you can weaken that to `volatile sig_atomic_t` – Peter Cordes Jan 05 '23 at 12:52
  • Yes, I meant that I can use atomic instead of volatile (which I don't need anywhere in x86 user-level programming, as opposed to bare-metal programming) – user1011113 Jan 05 '23 at 16:11

4 Answers4

6

I think that in this case both volatile and atomic will most likely work in practice on the 32 bit ARM. At least in an older version of STM32 tools I saw that in fact the C atomics were implemented using volatile for small types.

Volatile will work because the compiler may not optimize away any access to the variable that appears in the code.

However, the generated code must differ for types that cannot be loaded in a single instruction. If you use a volatile int64_t, the compiler will happily load it in two separate instructions. If the ISR runs between loading the two halves of the variable, you will load half the old value and half the new value.

Unfortunately using atomic<int64_t> may also fail with interrupt service routines if the implementation is not lock free. For Cortex-M, 64-bit accesses are not necessarily lockfree, so atomic should not be relied on without checking the implementation. Depending on the implementation, the system might deadlock if the locking mechanism is not reentrant and the interrupt happens while the lock is held. Since C++17, this can be queried by checking atomic<T>::is_always_lock_free. A specific answer for a specific atomic variable (this may depend on alignment) may be obtained by checking flagA.is_lock_free() since C++11.

So longer data must be protected by a separate mechanism (for example by turning off interrupts around the access and making the variable atomic or volatile.

So the correct way is to use std::atomic, as long as the access is lock free. If you are concerned about performance, it may pay off to select the appropriate memory order and stick to values that can be loaded in a single instruction.

Not using either would be wrong, the compiler will check the flag only once.

These functions all wait for a flag, but they get translated differently:

#include <atomic>
#include <cstdint>

using FlagT = std::int32_t;

volatile FlagT flag = 0;
void waitV()
{
    while (!flag) {}
}

std::atomic<FlagT> flagA;
void waitA()
{
    while(!flagA) {}    
}

void waitRelaxed()
{
    while(!flagA.load(std::memory_order_relaxed)) {}    
}

FlagT wrongFlag;
void waitWrong()
{
    while(!wrongFlag) {}
}

Using volatile you get a loop that reexamines the flag as you wanted:

waitV():
        ldr     r2, .L5
.L2:
        ldr     r3, [r2]
        cmp     r3, #0
        beq     .L2
        bx      lr
.L5:
        .word   .LANCHOR0

Atomic with the default sequentially consistent access produces synchronized access:

waitA():
        push    {r4, lr}
.L8:
        bl      __sync_synchronize
        ldr     r3, .L11
        ldr     r4, [r3, #4]
        bl      __sync_synchronize
        cmp     r4, #0
        beq     .L8
        pop     {r4}
        pop     {r0}
        bx      r0
.L11:
        .word   .LANCHOR0

If you do not care about the memory order you get a working loop just as with volatile:

waitRelaxed():
        ldr     r2, .L17
.L14:
        ldr     r3, [r2, #4]
        cmp     r3, #0
        beq     .L14
        bx      lr
.L17:
        .word   .LANCHOR0

Using neither volatile nor atomic will bite you with optimization enabled, as the flag is only checked once:

waitWrong():
        ldr     r3, .L24
        ldr     r3, [r3, #8]
        cmp     r3, #0
        bne     .L23
.L22:                        // infinite loop!
        b       .L22
.L23:
        bx      lr
.L24:
        .word   .LANCHOR0
flag:
flagA:
wrongFlag:
PaulR
  • 3,587
  • 14
  • 24
  • Interesting answer, but can the Godbolt gcc ARM compiler be trusted to generate the same code as gcc "ARM none EABI" used by STM32 bare metal tool chains? – Lundin Aug 19 '20 at 08:27
  • You can and should always run your-gcc -S to see the actual assembly output, or disassemble with objdump. Note also that your compilation for STM32 probably contains a significant number of additional flags, I just added what I could remember on the spot. The point is, with atomic the compiler must make sure that concurrent access works, with volatile the guarantee is different – PaulR Aug 19 '20 at 09:54
  • 1
    If a platform has no natural way of handling 64-bit operations atomically, an implementation's "atomic" features are unlikely to work reliably in conjunction with interrupts unless they can save the interrupt state, disable interrupts, perform the operation, and restore the interrupt state. If temporarily disabling interrupts would be acceptable, user code should be able to do that without need for an implementation's "atomic" features, and use the resulting semantics to do various things more easily than would be possible with "atomic". – supercat Aug 19 '20 at 20:06
  • 1
    I agree that `atomic` is the way to go, but see your argument about `volatile int64_t` as false - you cannot use `atomic` either (if its `is_lock_free()` is false). That would either use mutex (blocking the IRQ/ISR indefinitely) or LL-SC (which is bad idea to do in IRQ because LL-SC typically cannot be nested, break the logic if you do it). – firda Aug 20 '20 at 14:02
  • @firda: I don't think the Standard makes clear whether atomics that use ll/sc of the target type are supposed to indicate `is_lock_free()`. It's generally not possible for a compiler to guarantee that such operations will be technically lock free, but in practice they can often be guaranteed to make progress if a system ever manages to execute more than a few instructions between interrupts. For many purposes what's more important are that operations be *obstruction free*, and that they use the same locking mechanism as anything else on the system that needs to be atomic. – supercat Aug 20 '20 at 19:49
  • @supercat: LL-SC (LDREX/STREX) is spinlock, that is not lock-free. You either use `atomic_flag` which is the only guaranteed thing to work in ISR, or you need to make it platform-specific. There I bet on `atomit_int` when needed, because `volatile` may not be enough, *memory clobber* may not be enough (may need `DMB` or `DSB` instructions). so that `atomit` either does it right or it is simply not possible at all. (and you can add some `static_assert` or use [`ATOMIC_INT_LOCKFREE`](https://en.cppreference.com/w/c/atomic/ATOMIC_LOCK_FREE_consts). – firda Aug 22 '20 at 07:43
  • @firda: In many systems, the circumstances necessary to cause LL/SC to live-lock could never occur, though an implementation may have no way of knowing that. What's needed is a way for someone who knows the semantics of the underlying platform to have a consistent compiler-independent way of indicating those in the language--something for which C used to be good but has gotten progressively worse as compiler writers have lost sight of the fact that what made C useful was not the anemic abstraction model of the standard, but that the language could adapt to many abstraction models. – supercat Aug 22 '20 at 08:08
  • @firda: If one can't use `is_lock_free` to determine whether an implementation claims to use a platform's native semantics for atomic operations, what means should one use? Whether one needs a DMB or DSB depends upon the core and whether one is interacting with interrupts or with things like DMA that can alter memory without the core's involvement. Programmers will often know such things when compilers can't. – supercat Aug 22 '20 at 08:15
  • @supercat: read this https://en.cppreference.com/w/cpp/atomic/atomic/is_always_lock_free and this https://en.cppreference.com/w/c/atomic/ATOMIC_LOCK_FREE_consts Practically you either find lock-free solution or you have to disable interrupts. (And about your *In many systems, ...* not true for STM32 in question, you must use STREX with same address as last LDREX or you break the contract = UB = never do that in ISR). – firda Aug 22 '20 at 08:27
  • @firda: On the Cortex-M3, if an interrupt context switch occurs between an LDREX and STREX, it is guaranteed to invalidate the pending LDREX, so a subsequent STREX will report failure. If the time between an LDREX and STREX is sufficiently long that an interrupt will always occur between them, the STREX will never succeed, but if there ever will be a long enough time without interrupts, the LDREX/STREX loop will run until then. – supercat Aug 22 '20 at 09:21
  • @supercat: https://static.docs.arm.com/dui0553/a/DUI0553A_cortex_m4_dgug.pdf - page 83: *The result of executing a Store-Exclusive instruction to an address that is different from that used in the preceding Load-Exclusive instruction is unpredictable.* – firda Aug 22 '20 at 10:09
  • @supercat: P.S.: I see no real reason why the HW would not remember the last address used and make STREX fail if used with different, but that document states otherwise. I see no way to even implement thread-switching correctly if STREX was so broken. I would love it to work properly, but... seen HW not doing what you would expect way too often. Anyway, if you have beter document, please shere. Otherwise we should either move to chat, or leave this topic open. – firda Aug 22 '20 at 10:40
  • See https://developer.arm.com/documentation/dui0552/a/the-cortex-m3-processor/memory-model/synchronization-primitives?lang=en for information about ldrex/strex. Note in particular that processing an exception (interrupt) clears the exclusive-access flag, so on a Cortex-M3 the basic effect of "strex" is "perform the store unless an interrupt has occurred since the ldrex". BTW, I find myself curious why strex doesn't set flags, since code is almost certainly going to be interested in branching on whether it succeeded or failed. – supercat Aug 22 '20 at 17:58
  • Very interesting discussion, I know realize I have a huge lot more to learn on the topic! Thanks a lot :) I greatly appreciate the example and the methodology - inspect the assembly to be 100% sure. I wanted to know mostly if modern C++ compilers in 2020 would have already figured this out, but turns out they haven't (perhaps they never will?). For thread-safety I'll probably go for disable/enable interrupts for now, since I actually want to read arrays instead of 32-bit flags. @Lundin Godbolt does support arm-none-eabi :) https://godbolt.org/z/hdxz4b – user1011113 Aug 22 '20 at 22:14
  • @firda: There are a couple approaches a system can use for implementing something like LDREX/STREX: watch the address and make the STREX fail if anything happens to it, or else watch for anything "suspicious" happening and make the STREX fail if it does. The latter approach is simpler and easier to implement, but would work extremely poorly, if not unusably, in a multi-core system. A difference I don't remember whether the Cortex documentation mentioned is that when using the former approach, something like ... – supercat Aug 24 '20 at 14:45
  • `ldrex r1,[r0] / str r1,[r2] / strex r2,r1,[r0]` would result in the `strex` reporting failure if `r0` and `r2` are equal (because of the store to the r0/r2 address between the `ldrex` and `strex`) but when using the latter approach the `strex` would likely overwrite the value written by the `str` (unless an interrupt happened to occur between the ldrex/strex). – supercat Aug 24 '20 at 14:52
  • @firda: Of course, that leaves open the question of whether any/all versions of clang/gcc would refrain from other memory operations across `ldrex` and `strex`. If e.g. code 'ldrex'es a list head pointer, stores it into a new list item's "next" pointer, and then 'strex'es the list head pointer to the new item, having a compiler defer the update of the list item's "next" pointer past the strex could result in a wrong "next" pointer being read from the new item. – supercat Aug 24 '20 at 14:57
  • @supercat: Was searching a bit more and 1. I can confirm that `clrex` is auto-executed when interrupted (making following `strex` fail, making task-switching possible), but 2. any memory access between the two can lead to problems and unexpected/undefined behaviour (Exclusives Reservation Granule), leaving only one reliable usage for these - spinlocks (CAS/RMW). And that leads us back to the ISR deadlock (mutex-lock in ISR). So again: either lock-free atomics (`atomic_flag` especially) or disabling interrupts. Nothing else is reliable (in general, vendors can give better guarantees). – firda Aug 26 '20 at 08:05
2

Of the commercial compilers I've tested that weren't based on gcc or clang, all of them would treat a read or write via volatile pointer or lvalue as being capable of accessing any other object, without regard for whether it would seem possible for the pointer or lvalue to hit the object in question. Some, such as MSVC, formally documented the fact that volatile writes have release semantics and volatile reads have acquire semantics, while others would require a read/write pair to achieve acquire semantics.

Such semantics make it possible to use volatile objects to build a mutex that can guard "ordinary" objects on systems with a strong memory model (including single-core systems with interrupts), or on compilers that apply acquire/release barriers at the hardware memory ordering level rather than merely the compiler ordering level.

Neither clang or gcc, however, offers any option other than -O0 which would offer such semantics, since they would impede "optimizations" that would otherwise be able to convert code that performs seemingly-redundant loads and stores [that are actually needed for correct operation] into "more efficient" code [that doesn't work]. To make one's code usable with those, I would recommend defining a 'memory clobber' macro (which for clang or gcc would be asm volatile ("" ::: "memory");) and invoking it between the action which needs to precede a volatile write and the write itself, or between a volatile read and the first action which would need to follow it. If one does that, that would allow one's code to be readily adapted to implementations that would neither support nor require such barriers, simply by defining the macro as an empty expansion.

Note that while some compilers interpret all asm directives as a memory clobber, and there wouldn't be any other purpose for an empty asm directive, gcc simply ignores empty asm directives rather than interpreting them in such fashion.

An example of a situation where gcc's optimizations would prove problematic (clang seems to handle this particular case correctly, but some others still pose problems):

short buffer[10];
volatile short volatile *tx_ptr;
volatile int tx_count;
void test(void)
{
    buffer[0] = 1;
    tx_ptr = buffer;
    tx_count = 1;
    while(tx_count)
        ;
    buffer[0] = 2;
    tx_ptr = buffer;
    tx_count = 1;
    while(tx_count)
        ;
}

GCC will decide to optimize out the assignment buffer[0]=1; because the Standard doesn't require it to recognize that storing the buffer's address into a volatile might have side effects that would interact with the value stored there.

[edit: further experimentation shows that icc will reorder accesses to volatile objects, but since it reorders them even with respect to each other, I'm not sure what to make of that, since that would seem broken by any imaginable interpretation of the Standard].

supercat
  • 77,689
  • 9
  • 166
  • 211
2

To understand the issue, you must first understand why volatile is needed in the first place.

There are three completely separate issues here:

  1. Incorrect optimizations because the compiler doesn't realize that hardware callbacks such as ISRs are actually called.

    Solution: volatile or compiler awareness.

  2. Re-entrancy and race condition bugs caused by accessing a variable in several instructions and getting interrupted in the middle of it by an ISR using the same variable.

    Solution: protected or atomic access with mutex, _Atomic, disabled interrupts etc.

  3. Parallelism or pre-fetch cache bugs caused by instruction re-ordering, multi-core execution, branch prediction.

    Solution: memory barriers or allocation/execution in memory areas that aren't cached. volatile access may or may not act as a memory barrier on some systems.

As soon as someone brings this kind of question up of SO, you always get lots of PC programmers babbling about 2 and 3 without knowing or understanding anything about 1. This is because they have never in their life written an ISR and PC compilers with multi-threading are generally aware that thread callbacks will get executed, so this isn't typically an issue in PC programs.

What you need to do to solve 1) in your case, is to see if the compiler actually generates code for reading while (!flag), with or without optimizations enabled. Disassemble and check.

Ideally, compiler documentation will tell that the compiler understands the meaning of some compiler-specific extension such as the non-standard keyword interrupt and upon spotting it make no assumptions about that function not getting called.

Sadly though, most compilers only use interrupt etc keywords to generate the right calling convention and return instructions. I recently encountered the missing volatile bug just a few weeks ago, upon helping someone on a SE site and they were using a modern ARM tool chain. So I don't trust compilers to handle this still, in the year 2020, unless they explicitly document it. When in doubt use volatile.

Regarding 2) and re-entrancy, modern compilers do support _Atomic nowadays, which makes things very easy. Use it is it's available and reliable on your compiler. Otherwise, for most bare metal systems you can utilize the fact that interrupts are non-interruptable and use a plain bool as a "mutex lite" (example), as long as there is no instruction re-ordering (unlikely case for most MCUs).

But please note that 2) is a separate issue not related to volatile. volatile does not solve thread-safe access. Thread-safe access does not solve incorrect optimizations. So don't mix these two unrelated concepts up in the same mess, as often seen on SO.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/220277/discussion-on-answer-by-lundin-should-volatile-still-be-used-for-sharing-data-wi). – Jean-François Fabre Aug 22 '20 at 19:51
2

Short answer: always use std::atomic<T> whose is_lock_free() returns true.

Reasoning:

  1. volatile can work reliably on simple architectures (single-core, no cache, ARM/Cortex-M) like STM32F2 or ATSAMG55 with e.g. IAR compiler. But...
  2. It may fail to work as expected on more complex architectures (multi-core with cache) and when compiler tries to do certain optimisations (many examples in other answers, won't repeat that).
  3. atomic_flag and atomic_int (if is_lock_free() which they should) are safe to use anywhere, because they work like volatile with added memory bariers / synchronization when needed (avoiding the problems in previous point).
  4. The reason I specifically said you have to only use those with is_lock_free() being true is because you cannot stop IRQ as you could stop a thread. No, IRQ interrupts main loop and does its job, it cannot wait-lock on a mutex because it is blocking the main loop it would be waiting for.

Practical note: I personally either use atomic_flag (the one and only guaranteed to work) to implement sort of spin-lock, where ISR will disable itself when finding the lock locked, while main loop will always re-enable the ISR after unlocking. Or I use double-counter lock-free queue (SPSC - single producer, single consumer) using that atomit_int. (Have one reader-counter and one writer-counter, subtract to find the real count. Good for UART etc.)

firda
  • 3,268
  • 17
  • 30