3

I am trying to read three variables a, b, c atomically at once. The pattern looks something like the code below.

_Atomic uint32_t a, b, c;

void thread_high_priority(void)
{
  atomic_fetch_sub_explicit(&a, 1, memory_order_relaxed);
  atomic_fetch_add_explicit(&b, 1, memory_order_relaxed);
  atomic_fetch_sub_explicit(&c, 1, memory_order_relaxed);
}

void thread_low_priority(void)
{
  uint32_t _a = a;
  uint32_t _b = b;
  uint32_t _c = c;
}

thread_high_priority is a thread running in high priority and thread_low_priority running in low priority. thread_high_priority can interrupt the execution of thread_low_priority, but not the other way. That is, thread_high_priority will always run uninterruptedly.

The constraint is that thread_high_priority is time-critical. Therefore, I don't want to use a mutex to block as it is time-consuming and even causes deadlock. Is there a way to make sure all three variables are read at once without interruption?

Edit: The platform is ARMv7M architecture running in baremetal environment.

Taiyou Kuo
  • 31
  • 6
  • Portably, you can't do that in a lock-free way. What platform are you programming for? Does it have atomic 16-byte loads, like for example an Intel CPU with AVX, or some AArch64 CPUs with atomic aligned `ldp`? If so, then one option is to put your variables in a 16-byte-aligned `union { struct foo { _Atomic uint32_t a,b,c; } together; _Atomic struct { uint32_t a,b,c;} separate;}` or something. Similar to [How can I implement ABA counter with c++11 CAS?](https://stackoverflow.com/q/38984153) for efficient access to one at a time. – Peter Cordes Jun 24 '22 at 05:21
  • @PeterCordes My platform is ARMv7-M architecture. I read the reference that it says a 4-byte read can be atomic, but not for data types larger than 4 bytes. – Taiyou Kuo Jun 24 '22 at 05:27
  • I guess portably you *can* just always use an `_Atomic struct{...};`, and CAS in the updated values for RMWs, but it stops you from getting efficient access to individual members. (Manual padding to 16 bytes may help some implementations decide to make it lock-free). Depending on the ISA, it might be much less efficient to read just one member. – Peter Cordes Jun 24 '22 at 05:28
  • If your hardware only supports 4-byte atomicity, you're going to need to make your variables narrower or redesign things, e.g. perhaps indirection via a pointer like a queue, or something like RCU. (If you only need atomicity wrt. interrupts because of a single-core system, you might be able to use `ldm` / `stm`. But if you need RMW atomicity that doesn't work.) – Peter Cordes Jun 24 '22 at 05:30
  • @PeterCordes Thanks for your comment, I'll look into CAS and see how I can use it in my code. – Taiyou Kuo Jun 24 '22 at 05:39
  • 1
    You don't have lock-free 16-byte `atomic_compare_exchange_weak` on ARMv7-M. Only 4-byte `ldrex`/`strex`, apparently not even `ldrexd`/`strexd` 2-register. That comment was a follow-up my first comment that I was finishing just as you commented, not a reply to yours. – Peter Cordes Jun 24 '22 at 05:43
  • Are these variables only ever written by one high-priority thread? If so, you could use a SeqLock to let other threads detect tearing if it happened, and make it even cheaper for the high-priority thread to update them. (If it's the only write, no atomic RMWs needed, just loads and stores.) If there can be multiple writers, a seqlock needs a critical section for mutual exclusion between writers, i.e. a lock. (Readers are still lock-free, but you care about the writer.) – Peter Cordes Jun 24 '22 at 06:27
  • @PeterCordes Sorry for not specifying more clearly, this is a developing project and the variables will also be modified in the low-priority context. I will use `atomic_fetch_*` before or after reading these three variables. So there are only two threads, one that only decrements or increments the variables (the high-priority context), and the other reads and also modifies them (the low-priority context). – Taiyou Kuo Jun 24 '22 at 07:38
  • Ok, then you can't use a SeqLock. But RCU-style copying and changing a pointer could still work, if you CAS (`compare_exchange_weak`) on the pointer, like Tom's answer suggested. – Peter Cordes Jun 24 '22 at 07:44
  • BTW, your recent deleted answer was like half of a SeqLock. A write before and after is necessary on the write side, and same on the read side, read before and after. A flag can't work, a sequence counter can. Having the read side also write doesn't help anything. See [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) for an example of a seqlock with C++ std::atomic, can translate easily to C stdatomic. – Peter Cordes Jun 24 '22 at 20:58
  • @PeterCordes I immediately think that there was a mistake in my answer. Thanks, I will look into it. – Taiyou Kuo Jun 24 '22 at 21:08

2 Answers2

2

You can solve this problem with a level of indirection.

As long as there is only one writer, you could do it like this:

  • Put the set of data items in a struct
  • Allocate several such structs
  • Write non-atomically to the members of a struct which the readers are not using
  • Atomically change a pointer to which struct the reader should use

The reader should read the pointer then access the data in the respective struct.

If it is possible that another interrupt occurs while the main context is still reading then you need to keep a pointer to which struct the reader is using, and the writer can check this before filling out the struct. Accessing this second pointer atomically is easier if there is only one reader.

To smooth things out you can allocate three or more structs, and treat them as a ring buffer.

Tom V
  • 4,827
  • 2
  • 5
  • 22
  • Thanks for the approach, I will try to implement it in my code. – Taiyou Kuo Jun 24 '22 at 07:54
  • You're welcome! Can you please accept the answer with the tick mark on the left? – Tom V Jun 24 '22 at 11:59
  • "Write non-atomically to the members of a struct which the readers are not using": Note that this will requires some mechanism for the reader(s) to indicate when they are finished with a struct (including appropriate memory barriers). Otherwise you can never be sure whether some very slow reader is still using the struct you want to reuse or deallocate. – Nate Eldredge Jun 25 '22 at 18:58
  • @NateEldredge, yes I think I already covered that "If it is possible that another interrupt occurs while the main context is still reading then you need to keep a pointer to which struct the reader is using, and the writer can check this before filling out the struct". – Tom V Jun 25 '22 at 19:35
  • Sequence numbers in each bucket can maybe be useful. Perhaps for SeqLock style tearing detection, or like a proper queue that can detect when it's full. ([Lock-free Progress Guarantees in a circular buffer queue](https://stackoverflow.com/q/45907210) shows a queue with sequence numbers in the buckets.) – Peter Cordes Jun 26 '22 at 04:35
-1

I also come up with another solution that based on SeqLock. After knowing that what I tried to achieve is essentially tear-detection, I rewrite it using a SeqLock template. I still define my three variables a, b, c as _Atomic uint32_t since I also want to modify them in thread_low_priority using atomic_fetch_*.

On ARMv7-M archiecture RMW atomic operations are implement using ldrex/strex. The compiler will issue a loop to check whether strex success or not. In my case, it could be a problem when using RMW operations because thread_high_priority needs to be fast and run uninterruptedly. I currently don't know if there is a case where strex always failed in the thread_high_priority context that could cause deadlock.

_Atomic uint32_t a, b, c;
atomic_uint seqcount = 0;

void thread_high_priority(void)
{
  uint32_t _a, _b, _c;
  
  uint orig_cnt = atomic_load_explicit(&seqcount, memory_order_relaxed);

  atomic_store_explicit(&seqcount, orig_cnt + 1, memory_order_relaxed);
  atomic_thread_fence(memory_order_release);

  _a = atomic_load_explicit(&a, memory_order_relaxed);
  _b = atomic_load_explicit(&b, memory_order_relaxed);
  _c = atomic_load_explicit(&c, memory_order_relaxed);
  atomic_store_explicit(&a, _a - 1, memory_order_relaxed);
  atomic_store_explicit(&b, _b + 1, memory_order_relaxed);
  atomic_store_explicit(&c, _c - 1, memory_order_relaxed);

  atomic_store_explicit(&seqcount, orig_cnt + 2, memory_order_release);
}

void thread_low_priority(void)
{
  uint32_t _a, _b, _c;
  
  uint c0, c1;
  do {
    c0 = atomic_load_explicit(&seqcount, memory_order_acquire);

    _a = atomic_load_explicit(&a, memory_order_relaxed);
    _b = atomic_load_explicit(&b, memory_order_relaxed);
    _c = atomic_load_explicit(&c, memory_order_relaxed);

    c1 = atomic_load_explicit(&seqcount, memory_order_acquire);
  } while (c0 & 1 || c0 != c1);
}

Edit: Again after checking the output from compiler, I slightly modify my code in thread_high_priority. Compile using ARM gcc 10.3.1 (2021.10 none) with compilation flag -O1 -mcpu=cortex-m3 -std=gnu18 -mthumb.

In my original code, dmb ish is issued before the store as shown below.

atomic_store_explicit(&seqcount, orig_cnt + 1, memory_order_release);
--->
        adds    r1, r2, #1
        dmb     ish
        str     r1, [r3]

After I separate the memory barrier from store, dmb ish is issued after store, so that the update of seqcount is visible before updating a, b, c.

atomic_store_explicit(&seqcount, orig_cnt + 1, memory_order_relaxed);
atomic_thread_fence(memory_order_release);
-->
        adds    r1, r2, #1
        str     r1, [r3]
        dmb     ish
Taiyou Kuo
  • 31
  • 6
  • 1
    This looks similar to a SeqLock but unsafe. The reader can start reading before the writer finishes writing. (See [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) for a C++ seqlock, easily translates to C.) – Peter Cordes Jun 24 '22 at 21:28
  • Also, this leaves the reader spinning for a long time, and it spins on two different conditions. The while loop is waiting for `aflag` to be `true`. If the high priority thread has run at some time between reads, it will wait forever because nothing else sets aflag. So that seems pointless, no better than your first version. It could read a, then have the high-prio thread clear the flag and update a and b before the reader gets to b or c. Then it sees aflag was cleared and leaves the loop, with an old `a` and a new `b`, and `c`. – Peter Cordes Jun 24 '22 at 21:31
  • Is this supposed to be spinning until there's a new value? A SeqLock can do that if you check for a new sequence number. But you said `thread_low_priority` can write to a, b, or c at some point, and this doesn't even try to handle that. If you want something like this but not broken, use a SeqLock. – Peter Cordes Jun 24 '22 at 21:31
  • Probably my logic was wrong, I want to check `aflag` is still `true`. Which means `aflag` do not change after reading `a, b, c`, and not for checking new value coming in. I can understand that it might not be lock-free for `thread_low_priority` as it could block forever. – Taiyou Kuo Jun 24 '22 at 21:49
  • 3
    This pattern of lock-free writer, with tearing-detection and retry in the reader, is exactly what a SeqLock does. Just use that instead of attempting a buggy re-invention of it. Writing an `_Atomic uint32_t` is not more expensive than an `atomic_flag`, and the writer can be write-only with the reader read-only. Maybe if you say what you're trying to gain with this vs. a normal SeqLock, I could give some suggestion if a SeqLock doesn't already fully solve your problem. – Peter Cordes Jun 24 '22 at 21:52
  • Yes, your second version looks like a SeqLock, except you're missing `atomic_thread_fence(memory_order_release)` between the store of `seqcount` and the store of `a`,`b`,and `c`. (Or make them all `release`, but that would probably be more expensive on ARM32). You need to make sure that no payload updates are visible before the first sequence update. – Peter Cordes Jun 28 '22 at 02:09
  • If `thread_high_priority` is the only writer, you don't need atomic RMWs so it's better to do it this way, with separate load and store. That would also make it safe to do all 4 loads (including the sequence counter) before the first store of the sequence counter. – Peter Cordes Jun 28 '22 at 02:10
  • Also, you should remove the broken code at the top of your answer. I'm still pretty sure it can't be safe with the writer only updating a flag before or after the payload writes, not both. – Peter Cordes Jun 28 '22 at 02:11
  • 1
    BTW, the possibility of LDREX/STREX continuously failing to do `atomic_fetch_add` wouldn't be a deadlock; that would be a *livelock* - every execution attempt has a possibility of success, and would have to be actively disturbed by another thread to fail. Unlike a deadlock, like waiting for a taken lock that never gets unlocked. But yeah, if you don't need atomic RMW, much better to allow some load / store pipelining by doing all the loads first then all the stores. – Peter Cordes Jun 28 '22 at 02:23
  • @PeterCordes Thanks again for clearing up confusion. So after breaking from the do-while loop, it's guaranteed that no interrupt occurs when reading variables. Now, If I want to do `atomic_fetch_add` in the context of `thread_low_priority`, does `atomic_store` in `thread_high_priority` invalidate the data when an interrupt occurs inside RMW operation? So that RMW in `thread_low_priority` knows it failed last time and can simply redo again. – Taiyou Kuo Jun 28 '22 at 09:04
  • Yes, a properly implemented interrupt handler will use `clrex` if there's any possibility of false-positive `strex` success. But if something else could be atomically *incrementing* `a` while this `thread_high_priority` is running, the separate load and separate store no longer form an atomic RMW, so after both run, you could end up with a value that's only 1 higher, not 2. (high prio loads, low prio does atomic `a += 1`, high prio stores `a = old_a + 1`). – Peter Cordes Jun 28 '22 at 09:30
  • You could use atomic RMWs in both low and high prio threads, if it's ok that the high-prio thread doesn't get an atomic snapshot of all 3 variables as a group when another thread is modifying them. I'd initially said you couldn't use a SeqLock because you have both threads writing the data, but then you posted this answer with an idea that would only work for one thread being read-only. Now it seems that's not your actual situation after all. – Peter Cordes Jun 28 '22 at 09:32
  • @PeterCordes I can make the update only occur in `thread_high_priority` where there is only one writer. But doing so will harm encapsulation in terms of software development, which leak some unnecessary information to `thread_high_priority`. So I was originally searching for a solution for two writers and easy to implement as well. Then I found out RMW could possibly cause livelock, because the compiler issue a loop for every RMW operation on ARMv7M. So in the end, I will implement it with only one writer. – Taiyou Kuo Jun 28 '22 at 10:51
  • Livelock isn't the concern, that would only happen if your low prio thread was doing nothing but atomic RMWs of that cache line for a long time. Otherwise the high prio thread would maybe have to retry once or twice, but no big deal. My concern would be that having another writer means they can't keep the 3 values in sync as an atomically-updated group all the time. If the high-prio thread ever needs to read all 3 and see a consistent set of values, that can't happen while the low-prio thread might have updated only one. – Peter Cordes Jun 28 '22 at 10:58
  • If that's not a problem, only the low-prio thread needs a snapshot at times when the other thread is writing, then you might still be ok. If the low-prio thread isn't in the middle of writing, then the seqlock does still work to let it get a snapshot of the high-prio thread's increments, if they are just increments. And if it was also just incrementing, then when it's done they'll be back in sync. You haven't shown enough of the real problem for me to know whether or not your overall design is safe or not. – Peter Cordes Jun 28 '22 at 10:59
  • @PeterCordes The design is that, in the `thread_low_priority`, I need a snapshot of `a, b, c` and ensure no interrupt occurs when reading them. Because it will do some complicated arithmetic that depends on `a, b, c` on and I can't guarantee it to work when not read in sync. However, when updating in `thread_low_priority`, I only need to modify variables `a` and `b`. `a` is the most important because the value of `a` controls the flow in `thread_high_priority`. On the other hand, `b` will only be served as a counter and is never used by `thread_high_priority` aside from incrementing. – Taiyou Kuo Jun 28 '22 at 11:49
  • Lastly, `c` only updated in `thread_high_priority`. So `a, b, c` don't have to update as an atomically-updated group in `thread_low_priority`. – Taiyou Kuo Jun 28 '22 at 11:50
  • Ok then yeah, you can atomic RMW `a` and `b` in both threads (inside the SeqLock in high-prio), and for `c` do a cheaper separate load / store (with the store also inside the SeqLock). The SeqLock will make sure your low-prio thread can read a consistent snapshot without high-prio being in the middle of an increment of all 3. – Peter Cordes Jun 28 '22 at 11:52
  • @PeterCordes OK, thanks for keep following up on my problem. – Taiyou Kuo Jun 28 '22 at 12:00