memcpy for volatile arrays in gcc C on x86?

Question

I am using the C volatile keyword in combination with x86 memory ordering guarantees (writes are ordered with writes, and reads are ordered with reads) to implement a barrier-free message queue. Does gcc provide a builtin function that efficiently copies data from one volatile array to another?

I.e. is there a builtin/efficient function that we could call as memcpy_volatile is used in the following example?

uint8_t volatile      * dest = ...;
uint8_t volatile const* src  = ...;
int     len;
memcpy_volatile(dest, src, len);

rather than writing a naive loop?

This question is NOT about the popularity of barrier-free C programs. I am perfectly aware of barrier-based alternatives. This question is, therefore, also NOT a duplicate of any question where the answer is "use barrier primitives".

This question is also NOT a duplicate of similar questions that are not specific to x86/gcc, where of course the answer is "there's no general mechanism that works on all platforms".

Additional Detail

memcpy_volatile is not expected to be atomic. The ordering of operations within memcpy_volatile does not matter. What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. So if the other thread sees the new pointer (dest) then it must also see the data that was copied to *dest. This is the essential requirement for barrier-free queue implementation.

Do you have some memory-ordering requirement for the stores done by this memcpy? If not, just cast away `volatile` and call regular `memcpy((char*)dst, (char*)src, len);`, although that could maybe allow the compiler to invent loads, etc. (https://lwn.net/Articles/793253/) if used to copy a single `int` or something. For a larger copy, a compiler barrier like `asm("" ::: "memory")` should be sufficient to make sure it actually happened, and order it at compile-time vs. your volatiles. Still zero asm instructions. — Peter Cordes, Jun 16 '22 at 00:16
(Also, I don't get what you're hoping to gain vs. using `stdatomic.h` with `memory_order_relaxed`, or `acquire` / `release` which is also free on x86, only blocking compile-time reordering. Including wrt. non-atomic variables, unlike with volatile. It should compile to the same asm you can get with `volatile`, but in a standards-compliant way. Except you still need to cast points to get efficient bulk copies with memcpy, instead of doing each _Atomic access with a separate instruction.) — Peter Cordes, Jun 16 '22 at 00:16
Worst case, you might roll your own with `volatile __m256i*` pointers for an AVX copy loop. Or `volatile __m256i_u*` unaligned (or define your own unaligned vector type with GNU C syntax instead of relying on the internals of GCC's `immintrin.h`) — Peter Cordes, Jun 16 '22 at 00:18
@PeterCordes Isn't casting away `volatile` straight up UB, not just that it could cause invented loads? — Joseph Sible-Reinstate Monica, Jun 16 '22 at 00:36
@JosephSible-ReinstateMonica: So is a data-race on `volatile` objects. The reason it can cause invented loads is that it's UB in ISO C, but if you're rolling your own atomics you're not worried about the ISO C standard, rather about de-facto guarantees, and actual documented guarantees from things like `asm("" :::"memory")`. — Peter Cordes, Jun 16 '22 at 00:38
@JosephSible-ReinstateMonica: Or I guess the other important factor is that we don't know what concurrency requirements / conditions exist for this `memcpy`. Whether it's data that got declared as `volatile` because of later uses (so in C++20 we'd use non-`volatile` now, and `std::atomic_ref` for thread-safe accesses once other threads have started). Or if there's any atomicity requirement for elements larger than bytes, or if this is something like a SeqLock reader where we'll check for potential tearing after, and not use the copy result if there was any. — Peter Cordes, Jun 16 '22 at 00:43
A SeqLock use-case would have sufficient other stuff going on to make sure the copy actually happened at some point between two other accesses. — Peter Cordes, Jun 16 '22 at 00:44
@Peter Cordes. No there aren't atomicity requirements. Just that the sequence of memcpy(dest, ...) followed by advertising the dest pointer to another thread (by writing it to a volatile pointer variable) must appear in the same order to the other thread. So if the other thread sees the new dest pointer, it must also see the new data that was copied to *dest. — personal_cloud, Jun 16 '22 at 05:18
Ok, so the memcpy result is "published" via release/acquire synchronization. It doesn't need to be `volatile`, then, except to make sure of compile-time ordering because `volatile` operations are only guaranteed ordered wrt. other `volatile` operations. Just use `asm("" ::: "memory")` to block compile-time reordering instead (before the release-store). Or just use `atomic_store_explicit(&shared_pointer, dest, memory_order_release)` to get ordering wrt. non-atomic operations without any barrier instructions needed on strongly-ordered ISAs like x86. — Peter Cordes, Jun 16 '22 at 05:20
@Peter Yes, `asm volatile("" ::: "memory");` seems to be what I'm looking for... So basically I think you're saying I could put that directive in between a normal `memcpy` and its subsequent advertisement to a volatile variable. And then reverse the sequence on the reader side (to prevent some kind of prefetch optimization). — personal_cloud, Jun 16 '22 at 05:28
@Peter This is great... the `memcpy` makes my program run 25% faster than my hand-optimized volatile version! Of course, it "works" even without the `asm` directive, and the speed is the same... but I prefer to keep the directive so that the program is relying only on documented properties. I would accept your solution with the `asm` directive. `atomic_store_explicit` looks promising too... but is that available in plain C on x86? — personal_cloud, Jun 16 '22 at 05:34
@Peter Actually, I think it's even better than that in this application... I think I can do away with the `memcpy` altogether in most cases... I can cast the volatile pointer to a normal pointer, pass it over to normal code that creates the message in-place, then do the `asm` directive to force the message creation to complete before advertising the pointer. But I digress, as the question was just about `memcpy`. — personal_cloud, Jun 16 '22 at 05:51

score 1 · Accepted Answer · answered Jun 16 '22 at 05:54

memcpy_volatile is not expected to be atomic. ... What matters is that if memcpy_volatile(dest, ...) is done before advertising the dest pointer to another thread (via another volatile variable) then the sequence (data write, pointer write) must appear in the same order to the other thread. ...

Ok, that makes the problem solvable, you're just "publishing" the memcpy stores via release/acquire synchronization.

The buffers don't need to be volatile, then, except as one way to ensure compile-time ordering before some other volatile store. Because volatile operations are only guaranteed ordered (at compile time) wrt. other volatile operations. Since it's not being concurrently accessed while you're storing, the possible gotchas in Who's afraid of a big bad optimizing compiler? aren't a factor.

To hack this into your hand-rolled atomics with volatile, use GNU C asm("" ::: "memory") as a compiler memory barrier to block compile-time reordering between the release-store and the memcpy.

volatile uint8_t *shared_var;

  memcpy((char*)dest,  (const char*)src, len);
  asm("" ::: "memory");
  shared_var = dest;         // release-store

But really you're just making it inconvenient for yourself by avoiding C11 stdatomic.h for atomic_store_explicit(&shared_var, dest, memory_order_release) or GNU C __atomic_store_n(&shared_var, dest, __ATOMIC_RELEASE), which are ordered wrt. non-atomic accesses like a memcpy. Using a memory_order other than the default seq_cst will let it compile with no overhead for x86, to the same asm you get from volatile.

The compiler knows x86's memory ordering rules, and will take advantage of them by not using any extra barriers except for seq_cst stores. (Atomic RMWs on x86 are always full barriers, but you can't do those using volatile.)

Avoid RMW operations like x++ if you don't actually need atomicity for the whole operation; volatile x++ is more like atomic_store_explicit(&x, 1+atomic_load_explicit(&x, memory_order_acquire), memory_order_release); which is a big pain to type, but often you'd want to load into a tmp variable anyway.

If you're willing to use GNU C features like asm("" ::: "memory"), you can use its __atomic built-ins instead, without even having to change your variable declarations like you would for stdatomic.h.

volatile uint8_t *shared_var;

  memcpy((char*)dest,  (const char*)src, len);
  // a release-store is ordered after all previous stuff in this thread
  __atomic_store_explicit(&shared_var, dest, __ATOMIC_RELEASE);

As a bonus, doing it this way makes your code portable to non-x86 ISAs, e.g. AArch64 where it could compile the release-store to stlr. (And no separate barrier could be that efficient.)

The key point is that there's no down-side to the generated asm for x86.

As in When to use volatile with multi threading? - never. Use atomic with memory_order_relaxed, or with acquire / release to get C-level guarantees equivalent to x86 hardware memory-ordering.

Yes, the `asm` trick looks like an ideal solution! Thank you also for educating me re: acquire/release. But doesn't acquire/release imply an (expensive) memory barrier? Per that link you sent, "release semantics prevent memory reordering of the write-release with any *read or* write operation that precedes it" (emphasis on "read" mine). Being a read->write constraint, that seems stronger than x86's automatic guarantee, and would seem to require a memory barrier? — personal_cloud, Jun 16 '22 at 16:21
@personal_cloud: No, the only runtime memory reordering allowed on x86 is StoreLoad, not LoadStore, LoadLoad, or StoreStore. x86 is program-order + a store buffer with store-forwarding. Maintaining LoadStore ordering is not a burden at all (https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/); CPUs already want to load as early as possible, and store late ([after stores are non-seculative, hence a store-buffer](https://stackoverflow.com/questions/64141366/can-a-speculatively-executed-cpu-branch-contain-opcodes-that-access-ram)) — Peter Cordes, Jun 16 '22 at 20:27
Ah, thanks for explaining! Indeed, it looks like the gcc atomic builtins generate identical assembly code; there is no barrier instruction. Awesome! You also seem to be suggesting that I can then get rid of the `volatile` keyword on both the pointer and the data? Is that documented? I agree I no longer need it on the data. But don't I still need `volatile` on the pointer to guarantee it actually gets written out to memory? `__ATOMIC_RELEASE` prevents *sinking* of code from before to after the atomic, but I don't see that it prevents delaying the atomic write itself... — personal_cloud, Jun 18 '22 at 20:05
@personal_cloud: `__atomic` builtins implement ISO C11 / C++11 atomic semantics, which include a guarantee that stores in one thread will be seen by loads in other threads "reasonably" promptly (https://eel.is/c++draft/atomics.order#11), and in finite time (https://eel.is/c++draft/intro.progress#18). In practice as a quality-of-implementation issue, compilers treat accesses to `_Atomic` variables (and accesses done with `__atomic_...` builtins) very much like `volatile`: [Why don't compilers merge redundant std::atomic writes?](https://stackoverflow.com/q/45960387) — Peter Cordes, Jun 18 '22 at 20:37
Ah, there it is in note 11. Not just an eventuality guarantee, but timeliness too! Great point about portability too. And certainly the code looks a lot more readable without `volatile` everywhere... Thank you for solving this, and taking the time to educate me. — personal_cloud, Jun 18 '22 at 21:41

memcpy for volatile arrays in gcc C on x86?

1 Answers1