Testing lockless buffer copy in C using memory barriers

Question

I have a few questions regarding memory barriers.

Say I have the following C code (it will be run both from C++ and C code, so atomics are not possible) that writes an array into another one. Multiple threads may call thread_func(), and I want to make sure that my_str is returned only after it was initialized fully. In this case, it is a given that the last byte of the buffer can't be 0. As such, checking for the last byte as not 0, should suffice.

Due to reordering by compiler/CPU, this can be a problem as the last byte might get written before previous bytes, causing my_str to be returned with a partially copied buffer. So to get around this, I want to use a memory barrier. A mutex will work of course, but would be too heavy for my uses.

Keep in mind that all threads will call thread_func() with the same input, so even if multiple threads call init() a couple of times, it's OK as long as in the end, thread_func() returns a valid my_str, and that all subsequent calls after initialization return my_str directly.

Please tell me if all the following different code approaches work, or if there could be issues in some scenarios as aside from getting the solution to the problem, I'd like to get some more information regarding memory barriers.

__sync_bool_compare_and_swap on last byte. If I understand correctly, any memory store/load would not be reordered, not just the one for the particular variable that is sent to the command. Is that correct? if so, I would expect this to work as all previous writes of the previous bytes should be made before the barrier moves on.

 #define STR_LEN 100
 static uint8_t my_str[STR_LEN] = {0};

 static void init(uint8_t input_buf[STR_LEN])
 {
     for (int i = 0; i < STR_LEN - 1; ++i) {
         my_str[i] = input_buf[i];
     }
     __sync_bool_compare_and_swap(my_str, 0, input_buf[STR_LEN - 1]);
 }

 const char * thread_func(char input_buf[STR_LEN])
 {
     if (my_str[STR_LEN - 1] == 0) {
         init(input_buf);
     }
     return my_str;
 }

__sync_bool_compare_and_swap on each write. I would expect this to work as well, but to be slower than the first one.

 static void init(char input_buf[STR_LEN])
 {
     for (int i = 0; i < STR_LEN; ++i) {
         __sync_bool_compare_and_swap(my_str + i, 0, input_buf[i]);
     }
 }

__sync_synchronize before each byte copy. I would expect this to work as well, but is this slower or faster than (2)? __sync_bool_compare_and_swap is supposed to be a full barrier as well, so which would be preferable?
```
 static void init(char input_buf[STR_LEN])
 {
     for (int i = 0; i < STR_LEN; ++i) {
         __sync_synchronize();
         my_str[i] = input_buf[i];
     }
 }
```
__sync_synchronize by condition. As I understand it, __sync_synchronize is both a HW and SW memory barrier. As such, since the compiler can't tell the value of use_sync it shouldn't reorder. And the HW reordering will be done only if use_sync is true. is that correct?
```
 static void init(char input_buf[STR_LEN], bool use_sync)
 {
     for (int i = 0; i < STR_LEN; ++i) {
         if (use_sync) {
             __sync_synchronize();
         }
         my_str[i] = input_buf[i];
     }
 }
```

GNU C legacy `__sync` builtins are not recommended for new code. Use the `__atomic` builtins which can take a memory-order parameter like C11 stdatomic, but they're still builtins and still work on plain types not declared `_Atomic`. https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html . Just do the last store separately with a `__ATOMIC_RELEASE` store in the writer, not an RMW, and definitely not a full memory barrier (`__sync_synchronize()`) between every byte!!! That's *way* slower than necessary, and defeats any optimization to use `memcpy`. — Peter Cordes, Oct 13 '22 at 13:11
"*it will be run both from C++ and C code, so atomics are not possible*" -- whether the code can use atomics to serve the purpose is not a function of the language in which its callers are implemented. To the extent that language factors into it, it is the language in which this code itself is implemented that matters. — John Bollinger, Oct 13 '22 at 13:17
`if (my_str[STR_LEN - 1] == 0)` is of course data-race UB when there are concurrent writers. For safety it needs to be an `__ATOMIC_ACQUIRE` load, since you need a thread that loads a non-`0` value to also see all other stores by another thread that ran `init()`. — Peter Cordes, Oct 13 '22 at 13:18
@JohnBollinger true. i meant std::atomic, and didn't notice the other atomic builtins as Peter mentioned. — Itay Bianco, Oct 13 '22 at 13:21
@PeterCordes does it matter if there's a race, if both write the exact same value? as i understand it, single bytes are atomically written. and as i wrote in the question, all threads running will try to write the same string. so that's not a problem for me. — Itay Bianco, Oct 13 '22 at 13:23
`my_str[STR_LEN - 1] == 0` is a read that's racing with writers. You need it to be an acquire operation, assuming you're going to read other bytes of the string if you see the final byte non-zero. (You also need it not to get optimized out or in weird ways, although that should be impossible in practice just because there's probably no such possibility.) — Peter Cordes, Oct 13 '22 at 13:24
From the point of view of language semantics at the C or C++ level, it does not matter whether all the writes write the same value. Multiple writes to the same storage produce a data race if they are not suitably synchronized with respect to each other. Formally speaking, data races produce undefined behavior. Whether that constitutes a problem for your particular code in practice is not clear, but it is not safe to judge that based on machine-language semantics. — John Bollinger, Oct 13 '22 at 13:28
As for writes of the same value, in the ISO standard it's clearly UB, but in practice it should be fine except with `clang -fsanitize=thread`. [Race Condition with writing same value in C++?](https://stackoverflow.com/q/52371719) and multiple other duplicates. In theory a DeathStation9000 could implement non-atomic stores by storing value+1 and then subtracting 1, so temporarily there's be a different value in memory. But AFAIK there aren't real compilers that do that. — Peter Cordes, Oct 13 '22 at 13:29
*does it matter if there's a race, if both write the exact same value?*: Yes, it still matters; the C standard makes no exception for this case. It's still undefined behavior, and compilers can and will take advantage of this to perform optimizations that would break it. *single bytes are atomically written*: The C language makes no such guarantee. It doesn't really matter whether your machine has atomic single-byte store instructions, because the compiler would be allowed to optimize in such a way that it doesn't *execute* that instruction. — Nate Eldredge, Oct 13 '22 at 13:30
@NateEldredge thanks! this reinforces the distinction between the compiler and the CPU for me when writing code. — Itay Bianco, Oct 13 '22 at 13:50

Peter Cordes · Accepted Answer · 2022-10-13T14:16:47.383

GNU C legacy __sync builtins are not recommended for new code, as the manual says.

Use the __atomic builtins which can take a memory-order parameter like C11 stdatomic. But they're still builtins and still work on plain types not declared _Atomic, so using them is like C++20 std::atomic_ref. In C++20, use std::atomic_ref<unsigned char>(my_str[STR_LEN - 1]), but C doesn't provide an equivalent so you'd have to use compiler builtins to hand-roll it.

Just do the last store separately with a release store in the writer, not an RMW, and definitely not a full memory barrier (__sync_synchronize()) between every byte!!! That's way slower than necessary, and defeats any optimization to use memcpy. Also, you need the store of the final byte to be at least RELEASE, not a plain store, so readers can synchronize with it. See also Who's afraid of a big bad optimizing compiler? re: how exactly compilers can break your code if you try to hand-roll lockless code with just barriers, not atomic loads or stores. (It's written for Linux kernel code, where a macro would use *(volatile char*) to hand-roll something close to __atomic_store_n with __ATOMIC_RELAXED`)

So something like

__atomic_store_n(&my_str[STR_LEN - 1], input_buf[STR_LEN - 1], __ATOMIC_RELEASE);

The if (my_str[STR_LEN - 1] == 0) load in thread_func is of course data-race UB when there are concurrent writers.

For safety it needs to be an acquire load, like __atomic_load_n(&my_str[STR_LEN - 1], __ATOMIC_ACQUIRE) == 0, since you need a thread that loads a non-0 value to also see all other stores by another thread that ran init(). (Which did a release-store to that location, creating acquire/release synchronization and guaranteeing a happens-before relationship between these threads.)

See https://preshing.com/20120913/acquire-and-release-semantics/

Writing the same value non-atomically is also UB in ISO C and ISO C++. See Race Condition with writing same value in C++? and others.

But in practice it should be fine except with clang -fsanitize=thread. In theory a DeathStation9000 could implement non-atomic stores by storing value+1 and then subtracting 1, so temporarily there's be a different value in memory. But AFAIK there aren't real compilers that do that. I'd have a look at the generated asm on any new compiler / ISA combination you're trying, just to make sure.

It would be hard to test; the init stuff can only race once per program invocation. But there's no fully safe way to do it that doesn't totally suck for performance, AFAIK. Perhaps doing the init with a cast to _Atomic unsigned char* or typedef _Atomic unsigned long __attribute__((may_alias)) aliasing_atomic_ulong; as a building block for a manual copy loop?

Bonus question: `if(use_sync) __sync_synchronize()` inside the loop.

Since the compiler can't tell the value of use_sync it shouldn't reorder.

Optimization is possible to asm that works something like if(use_sync) { slow barrier loop } else { no-barrier loop }. This is called "loop unswitching": making two loops and branching once to decide which to run, instead of every iteration. GCC has been able to do that optimization (in some cases) since 3.4. So that defeats your attempt to take advantage of how the compiler would compile to trick it into doing more ordering than the source actually requires.

And the HW reordering will be done only if use_sync is true.

Yes, that part is correct.

Also, inlining and constant-propagation of use_sync could easily defeat this, unless use_sync was a volatile global or something. At that point you might as well just make a separate _Atomic unsigned char array_init_done flag / guard variable.

And you can use it for mutual exclusion by having threads try to set it to 1 with int old = guard.exchange(1), with the winner of the race being the one to run init while they spin-wait (or C++20 .wait(1)) for the guard variable to become 2 or -1 or something, which the winner of the race will set after finishing init.

Have a look at the asm GCC makes for non-constant-initialized static local vars; they check a guard variable with an acquire load, only doing locking to have one thread do the run_once init stuff and the others wait for that result. IIRC there's a Q&A about doing that yourself with atomics.

thanks a lot for the detailed responses. as a general curiosity, can you answer question (4)? — Itay Bianco, Oct 13 '22 at 13:52
you mentioned that there is still UB in case 2 threads are calling init() together. if i use memcpy (on arm, on which it is inherently thread-safe) to store the buffer, should i still store/load the last byte using an atomic operation with an ATOMIC_ACQUIRE/RELEASE in order to set a happens-before synchronization? or is memcpy considered a synchronization point, and only the atomic load is needed? in this case, atomic loading the first byte to check would also work, since in this case the storing is atomic for all bytes, is that correct? — Itay Bianco, Oct 25 '22 at 08:01
@ItayBianco: `memcpy` doesn't guarantee that the last byte or chunk gets stored with RELEASE semantics, so it's not helpful. IDK what you mean about it being "thread safe". It's certainly not an atomic store of *all* the bytes together, unless the copy size happens to be 4 bytes and they're aligned, in which case maybe. It sounds like you're wildly over-interpreting something like POSIX saying that `memcpy` is MT-safe, which just means it doesn't have any internal static state so different threads can simultaneously call `memcpy` on *different* regions without a problem. — Peter Cordes, Oct 25 '22 at 12:38
I'm guessing it's not RELEASE as well. But what do you make of it, regarding inherently thread safe? https://developer.arm.com/documentation/dui0475/m/the-arm-c-and-c---libraries/multithreaded-support-in-arm-c-libraries/thread-safety-in-the-arm-c-library — Itay Bianco, Oct 26 '22 at 14:04
@ItayBianco: Yeah, like I thought, they just mean thread-safe in terms of not accessing any internal shared state. (The `malloc` example makes that clear, that it *would* be a problem without `_mutex_...` functions, for two threads to call the function at the same time.) They definitely don't mean it's safe to have two threads `memcpy` to the same buffer at the same time! Or that it won't crash, but there's no guarantee of what you'll find in the buffer when you're done. — Peter Cordes, Oct 26 '22 at 14:13
Yes it seems you are correct. Here is more data in case anyone else gets here: https://developer.arm.com/documentation/den0013/d/Parallelizing-Software/Threading-libraries/Thread-safety-and-reentrancy. But as so far as avoiding UB for my problem, either using memcpy or store RELAXED, for all but the last byte which is RELEASE, should be fine, right? In my testing, the performance penalty was not very big. — Itay Bianco, Oct 26 '22 at 14:58
@ItayBianco: Yes, sure, you can use the final byte for synchronization if the readers can reliably tell that it changed, so they know whether or not they saw the writer's data instead of whatever garbage was in the buffer before. Instead of a single byte, you could also leave a whole word to be done as a release store; for a normal buffer size, that would leave memcpy doing a whole number of words, instead of having to do an odd 3 bytes at the end. And the final release store can be a whole word, not a byte, which is slightly more efficient on most ARM chips. — Peter Cordes, Oct 26 '22 at 15:35
Thanks a lot for putting up with all the questions. It's a great help. — Itay Bianco, Oct 26 '22 at 16:04

Testing lockless buffer copy in C using memory barriers

1 Answers1

Bonus question: if(use_sync) __sync_synchronize() inside the loop.

Bonus question: `if(use_sync) __sync_synchronize()` inside the loop.