Correct usage of C++ atomics around flag & ISR usage

Question

I have a case where I need to ensure that an ISR cannot interact with a library when it is unloaded in an embedded system (ARM Cortex-M4 based).

The library can be loaded or deloaded at any time, as well as the interrupt / ISR firing off at any time. These library load / deload / process ISR calls are wrapped inside project specific routines, so I have full control over what is executed at the boundary.

For this, it seems like the natural way to protect against this is to add in a "is loaded" flag (and operate on using acquire / release orderings), and only call the process ISR library routine when I can be sure that the library is currently loaded.

However, the deload wrapper routine presents a problem - the release memory ordering (my initial thought) does not cause the "is loaded" flag to be seen before the library is actually deloaded, as according to the C++ release semantics, this only provides guarantees for no loads/stores before the atomic operation to be reordered, not load/stores after. I really want a "store with acquire ordering" - the flag must be cleared and visible to the ISR before the library is deloaded (no load/stores can be reordered to before the operation) - but the C++ standard seems to imply this isn't possible at least with normal acquire / release.

I have explored the idea of using SeqCst operations / fences due to additional total ordering property, but I am not confident that I am using it correctly. Inspecting the assembly of the fence version, it appears correct with DMB appearing in the right places, however reading the standard about atomic_thread_fence and Fence-fence synchronization, it seems like this is technically incorrect usage as "FA is not sequenced before X in thread A".

Is someone able to walk through the process of figuring out whether this usage is correct or not (using C++ standard terms, ie: https://en.cppreference.com/w/cpp/atomic/memory_order, https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence)?

Equivilant code link: https://godbolt.org/z/v9nzqvjxa - this uses SeqCst fences.

The internals of the library code are irrelevant and can be treated as an opaque box.

Edit:

Copy of equivilant code link above:

#include <atomic>
#include <thread>
#include <iostream>
#include <stdexcept>
#include <chrono>
#include <random>

/// Library code

bool non_atomic_state;

void load_library()
{
    if (!non_atomic_state)
        std::cout << "loaded" << std::endl;
    non_atomic_state = true;
}

void deload_library()
{
    if (non_atomic_state)
        std::cout << "deloaded library" << std::endl;
    non_atomic_state = false;
}

void library_isr_routine()
{
    if (!non_atomic_state)
        throw std::runtime_error("crash");
    std::cout << "library routine" << std::endl;
}

/// MCU project code

std::atomic<bool> loaded;

void mcu_library_isr()
{
    if (loaded.load(std::memory_order_relaxed))
    {
        std::atomic_thread_fence(std::memory_order_seq_cst);

        library_isr_routine();
    }
}

void mcu_load_library()
{
    load_library();

    std::atomic_thread_fence(std::memory_order_seq_cst);

    loaded.store(true, std::memory_order_relaxed);
}

void mcu_deload_library()
{
    loaded.store(false, std::memory_order_relaxed);

    std::atomic_thread_fence(std::memory_order_seq_cst);

    deload_library();
}

/// Test harness code

void t1()
{
    std::random_device rd;

    for (int i = 0; i < 10000; i++)
    {
        auto sleep_duration = std::chrono::milliseconds(rd() % 10);
        std::this_thread::sleep_for(sleep_duration);
        mcu_library_isr();
    }
}

void t2()
{
    std::random_device rd;

    for (int i = 0; i < 10000; i++)
    {
        auto random_value = rd();
        auto sleep_duration = std::chrono::milliseconds(random_value % 10);
        std::this_thread::sleep_for(sleep_duration);
        if (random_value % 2 == 0)
            mcu_load_library();
        else
            mcu_deload_library();
    }
}

int main()
{
    std::thread t1_handle(t1);
    std::thread t2_handle(t2);

    t1_handle.join();
    t2_handle.join();

    return 0;
}

What happens if ISR is using the library and you unload it at the same time? Or is it somehow impossible (only one core, occupied by the ISR; or are other cores paused while ISR runs?). — HolyBlackCat, Jan 31 '23 at 04:38
There is only one core in this MCU so it is impossible for the library to be deloaded in the ISR. However, compiler / runtime memory reordering is allowed, which is what this is really trying to protect against. In any case, I'm interested to how you would solve this problem in general. — marco9999, Jan 31 '23 at 04:49
On a single-core machine, at the machine level, memory barriers should be irrelevant, right? The interrupt arrives between a distinct pair of instructions. It sees the effect of all earlier instructions and of no later instructions. It's fine if the store buffer remains unflushed because the interrupt handler would get its loads from there just like the main code. It's when you have multiple cores that this becomes an issue, because one core *can't* see into another core's store buffer. — Nate Eldredge, Jan 31 '23 at 04:53
What really matters here is getting a *compiler* barrier, so that the load and store instructions are actually emitted in program order. Once that's done, they will execute just fine. That's what `atomic_signal_fence` does, and I think it's what you want here. — Nate Eldredge, Jan 31 '23 at 04:54
Yes correct that in a single core case, only compiler reorderings should matter. I am also interested in the general case where there could be a multi-core system. — marco9999, Jan 31 '23 at 05:00
Additionally, wouldn't `atomic_signal_fence` also have to adhere to the same requirements as `atomic_thread_fence`? At least cppreference makes it sound like that to me. — marco9999, Jan 31 '23 at 05:02
`atomic_signal_fence` is like `atomic_thread_fence`, but it only provides its guarantees within a single thread of execution. CPU hardware already guarantees that a single core preserves the illusion of executing in program order (you never need barriers to see your own stores, just store before the load). This is the cardinal rule of out-of-order exec and pipelining, part of the execution model of an ISA that's guaranteed on paper. This includes running the instructions of an ISR between two instructions of the code being interrupted. (Core vs. thread: cpu-migrations have to sync-with) — Peter Cordes, Jan 31 '23 at 05:20
I do agree that `atomic_signal_fence(release)` is the way to go in my case - the brief description and the references to consult `atomic_thread_fence` on cppreference make me question it a bit, but I don't see another way. — marco9999, Jan 31 '23 at 06:20
If you have several cores, then I think what you want is the equivalent of a [`shared_mutex`](https://en.cppreference.com/w/cpp/thread/shared_mutex). Each ISR that wants to use the library will first `try_lock_shared`. On success, use the library and then unlock the mutex; on failure, don't use the library. Then when the main thread is ready to unload the library, take the lock exclusively. When you succeed, all ISR users are done and no more will start, so it is safe to unload. — Nate Eldredge, Jan 31 '23 at 14:28
Oh no, wait, it doesn't quite work because you're not allowed to try a `shared_mutex` twice in the same thread. You would need a `recursive_shared_mutex` but standard C++ doesn't provide one. Still, you could implement your own equivalent of this. — Nate Eldredge, Jan 31 '23 at 14:33
@NateEldredge: It would be nice to avoid having every core actually take a readers-writers lock every interrupt; the cache line holding it will have to bounce around a lot. Assuming load/unload is rare compared to interrupts, for max efficiency on a many-core system you'd maybe want to look at the same ideas as RCU to make sure there aren't still any cores running the library code after you change a bool to stop new threads from entering. (`run_on()` every core in sequence, which can't happen in the middle of an ISR.) So the ISR only reads a shared atomic bool, no extra writing — Peter Cordes, Jan 31 '23 at 16:32

score 1 · Answer 1 · answered Jan 31 '23 at 06:59

1

(More a point of discussion rather than a canoncial answer)

Wouldn't an acquire-release operation with subsequent control-flow dependency resolve all ambiguity?

Like this:

if(loaded.exchange(false, std::memory_order_acq_rel))
    deload_library();

Since it is an acquire-release, the deloading cannot be reordered before the exchange. As a bonus, it protects against two threads attempting to deload but not a second thread loading while the deloading is still in progress.

answered Jan 31 '23 at 06:59

Homer512

9,144
2
8
25

2

In fact, no, this does not work in general. The reason is subtle: an acq-rel atomic RMW such as `exchange` can actually be implemented as an acquire load followed by a release store, with the cache line held as exclusive in between so that `loaded` cannot be modified by another core. The load is acquire, so `deload_library` cannot be reordered before the load, but it *can* be reordered before the store, since release semantics don't forbid that. In effect, `deload_library` can happen *between* the load and the store. – Nate Eldredge Jan 31 '23 at 14:22
2

Nothing in the C++ memory model disallows this behavior, and it does in fact happen on real implementations. See https://stackoverflow.com/questions/65568185/for-purposes-of-ordering-is-atomic-read-modify-write-one-operation-or-two for a test case. – Nate Eldredge Jan 31 '23 at 14:22
@NateEldredge Oh, right, I think I knew this at some point. If I see it correctly, the only sane way is to use a proper acquire-release procedure, e.g. by changing the bool into a reference counter with the ISR incrementing the counter while in use. Would be necessary anyway in a multicore system since the unload could start while an ISR runs on a different core – Homer512 Jan 31 '23 at 19:25

Correct usage of C++ atomics around flag & ISR usage

1 Answers1