Achieve atomic increment on 64bit variable on 32 bit system

Question

I am trying to do an atomic increment on a 64bit variable on a 32 bit system. I am trying to use atomic_fetch_add_explicit(&system_tick_counter_us,1, memory_order_relaxed);

But the compiler throws out an error - warning: large atomic operation may incur significant performance penalty; the access size (8 bytes) exceeds the max lock-free size (4 bytes) [-Watomic-alignment]

My question is how I can achieve atomicity without using critical sections.

Standard atomic function try to use atomic instruction when available. When this is not possible, they use a slow lock. It happens when the target processor does not support that OR when the standard library is not optimized for the target processor. Without more information about the target processor, it is not possible to answer this question. This is especially true since 32 bit processors are now gone one PC/servers and they are AFAIK only use in things like embedded systems with more unusual processors. Besides, isn't `atomic_fetch_add_explicit` a C++ function (not C) ? — Jérôme Richard, Jan 11 '23 at 19:22
That is not an error message. That is a warning message. "Warning" doesn't mean that your program is wrong. It only means that you've done something that often is a mistake when other people do it. Maybe it was a mistake when you did it. Maybe not. If you are really sure it was not a mistake, then often there is some way to mark the code that triggered the warning (e.g., surrounding it with a `#pragma ...`) so that the compiler won't warn you about it in the future. — Solomon Slow, Jan 11 '23 at 19:25
Yea its a warning but it means my increment operation will not be atomic. — Priyank Pandya, Jan 11 '23 at 19:27
Re, "...it means my increment operation will not be atomic." That's not what the message says. The message says that your operation "may incur significant performance penalty." I.e., it may not be as fast as you were expecting. — Solomon Slow, Jan 11 '23 at 19:31
@JérômeRichard, it was added in C11, see https://en.cppreference.com/w/c/atomic/atomic_fetch_add — tstanisl, Jan 11 '23 at 19:55
@PriyankPandya: It will still be atomic, but the compiler will achieve that by effectively making this a critical section. A mutex will be created somewhere behind the scenes, and all atomic accesses to this variable will lock and unlock the mutex. That's what it means when an atomic is "not lock free". It's a problem for performance but not for correctness. — Nate Eldredge, Jan 11 '23 at 21:55
Thank you all for answering my doubt. I think using std atomic for 64bit add \will have a performance impact I think its still better than using critical section in code. — Priyank Pandya, Jan 11 '23 at 23:42
What 32-bit system? If it's ARM, ARMv7-A has 64-bit atomic RMW, but you have to use the right compiler options to tell the compiler it can use those instructions, not just baseline ARMv5 or whatever. But other targets simply don't support it in a lock-free way, in which case you're stuck with non-lock-free `_Atomic uint64_t`, or using other algorithms for your wider data such as a SeqLock ([Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) / [A readers/writer lock... without having a lock for the readers?](https://stackoverflow.com/q/61237650)) — Peter Cordes, Jan 12 '23 at 04:01

score 3 · Accepted Answer · answered Jan 11 '23 at 19:28

3

how I can achieve atomicity without using critical sections?

On an object that is larger than the size of a single memory "word?" You probably can't. End of story. Use a mutex. Or, use atomic<...> and accept that the library will use a mutex on your behalf.

answered Jan 11 '23 at 19:28

Solomon Slow

25,130
5
37
57

1

This is *not true on all platforms*. In fact, mainstream x86-64 processors have a `LOCK CMPXCHG16B` instruction for example able to atomically lock 128-bit of data while this is executed by 64-bit processors. AFAIK, `LOCK CMPXCHG8B` is supported on 32-bit x86 processors while it deal with 64-bit data. – Jérôme Richard Jan 12 '23 at 02:24
1

@JérômeRichard: Yup, and 32-bit x86 has cheap pure-load and pure-store for aligned 64-bit objects, using SSE2 `movq xmm, [mem]`. (Intel recently documented that the AVX feature bit implies 16-byte atomicity for aligned 16-byte load/store, but before that x86-64 had to use `lock cmpxchg16b` for `foo.load()` or `foo.store()`, so it didn't have the expected read-side scalability, see [How can I implement ABA counter with c++11 CAS?](https://stackoverflow.com/q/38984153)). – Peter Cordes Jan 12 '23 at 03:44
1

@JérômeRichard Anyway yes, many 32-bit ISAs provide ways to do 8-byte atomic RMWs, like ARMv7 with `ldrexd` / `strexd`. Many algorithms want to RMW something the width of 2 pointers, or wider than 1 pointer at least. https://godbolt.org/z/s6scE8xc8 (but not all. Maybe there's an option for PowerPC I'm not aware of, or maybe GCC calls libatomic functions because it's useful to dispatch according to something, like if running on a 64-bit-capable Power CPU?) Anyway, `gcc -m32` and ARM `clang -mcpu=cortex-a15` (a fairly old ARMv7) both inline lock-free `fetch_add` – Peter Cordes Jan 12 '23 at 03:53

Willis Hershey · Answer 2 · 2023-01-11T19:50:03.683

If you need to read the 64-bit value while the program is running then you probably can't do this safely without a mutex as others have said, but on the off-chance that you only need to read this value after all of the threads have finished, then you can implement this with an array of 2 32-bit atomic variables.

Since your system can only guarantee atomicity of this type on 4-byte memory regions, you should use those instead to maximize performance, for instance:

#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>

_Atomic uint32_t system_tick_counter_us[2];

Then increment one of those two 4-byte atomic variables whenever you want to increment an 8-byte one, then check if it overflowed, and if it did, atomically increment the other. Keep in mind that atomic_fetch_add_explicit returns the value of the atomic variable before it was incremented, so it's important to check for the value that will cause the overflow, not zero.

if(atomic_fetch_add_explicit(&system_tick_counter_us[0], 1, memory_order_relaxed) == (uint32_t)0-1)
    atomic_fetch_add_explicit(&system_tick_counter_us[1], 1, memory_order_relaxed);

However, as I mentioned, this can cause a race condition in the case that the 64-bit variable is constructed between system_tick_counter_us[0] overflowing and that same thread incrementing system_tick_counter_us[1] but if you can find a way to guarantee that all threads are done executing the two lines above, then this is a safe solution.

The 64-bit value can be constructed as ((uint64_t)system_tick_counter_us[1] << 32) | (uint64_t)system_tick_counter_us[0] once you're sure the memory is no longer being modified

The standard way to have one thread (or timer interrupt) occasionally updating a timestamp that's read frequently is a SeqLock. That allows read-only readers in the common case fast path, but can detect tearing. [Implementing 64 bit atomic counter with 32 bit atomics](https://stackoverflow.com/q/54611003) / [A readers/writer lock... without having a lock for the readers?](https://stackoverflow.com/q/61237650) — Peter Cordes, Jan 12 '23 at 04:00

score 0 · Answer 3 · answered Jan 11 '23 at 19:38

You can't even do this using 32bit data on systems which have to read the memory, modify the value and save it. (RMW). Almost all (if not all) RISC processors do not have instructions modifying the memory in a single instruction. It includes all ARM-Cortex micros, RISCV-V and many many other processors.

Many of them have special hardware mechanisms which can help archive the atomic access (at least preventing other processes to access the data). Cortex-M cores have LDREX, STREX instructions, some have hardware mutexes or semaphores but they still require the programmer to provide atomic (or at least mutually excluded access) to the memory location

When you say "ARM Cortex micros", you mean microcontrollers like Cortex-M series? ARM Cortex-A CPUs like Cortex-A15 (an old ARMv7-A) have `ldrexd` and `strexd` to support lock-free `std::atomic` including stuff like `fetch_add`. https://godbolt.org/z/4YnYnGYG6 — Peter Cordes, Jan 12 '23 at 03:56

Achieve atomic increment on 64bit variable on 32 bit system

3 Answers3