3

Global data:

uint16_t global_buffer[128];

Thread 1:

uint16_t local_buffer[128];
while(true)
{
    ...
    if(data_ready)
        memcpy(global_buffer, local_buffer, sizeof(uint16_t)*128);
}

Thread 2:

void timer_handler()
{
    uint16_t value = global_buffer[10];
    //do something with value
}

My question is whether this is safe to do? I mean, is it guaranteed that value will either get an old value or a new value (if thread 1 memcpy() is interrupted by context switch)? Is it possible that the memcpy gets interrupted after one byte of the 16-bit value is updated but not the second. In that case, value will be garbage.

If memcpy operation only gets interrupted in between blocks of even number of bytes, I think this is safe.

Platforms: x86 & x86-64 only (only Intel i7 processor or newer actually)
OS: Linux
Compiler: gcc

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Syam
  • 163
  • 1
  • 8
  • 2
    Looks like a hypothetical and unrealistic scenario; is there a more concrete issue you are trying to solve. If your real solution relies on unreliable "wishful thinking" behaviour, it probably needs a redesign rather than confirmation of the behaviour of `memcpy()`. – Clifford May 22 '21 at 07:26
  • 1
    It will be hard to get guarantees, but you are not alone in relying on this kind of partial atomicity. (btw, the [x86] tag could be useful) – Marc Glisse May 22 '21 at 07:26
  • Consider the code might be running on 8-bit machines. There's no thread-safe copying possible without appropriate protection (and such MCU's are still available today, e. g. [STM8 by ST](https://www.st.com/en/microcontrollers-microprocessors/stm8-8-bit-mcus.html) or some by [NXP](https://www.nxp.com/products/processors-and-microcontrollers/additional-mpu-mcus-architectures/8-bit-s08-mcus:HCS08). I think there are even some 8086 derivatives still around today, but not putting my hand on fire for). – Aconcagua May 22 '21 at 08:03
  • @Clifford This is from a real application, just not my design. The problem is that there are too many threads that read from the global buffer. So adding mutex lock everywhere is cumbersome. Another option is to use a reader-writer lock so that concurrent reads are allowed and exclusive locking happens only when it is written. – Syam May 22 '21 at 08:58
  • @Aconcagua I have edited my question for more clarity on that x86 thing. I'm exclusively on modern processors (i7 or newer). I am running both 32-bit and 64-bit OS though. – Syam May 22 '21 at 09:00
  • 1
    The asm part is going to boil down to [Per-element atomicity of vector load/store and gather/scatter?](https://stackoverflow.com/q/46012574) - no vendor documented guarantees, unfortunately, but almost certainly safe in practice. But you realize this has data-race undefined behaviour, right? You can't safely do this without extra memory barriers like `asm("" ::: "memory")` to put bounds on reordering of the reads and writes. (Or for the scalar read, better to use `__atomic_load_n` with `__ATOMIC_RELAXED` https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html) – Peter Cordes May 22 '21 at 09:14
  • And see also https://lwn.net/Articles/793253/ - Who's afraid of a big bad optimizing compiler? re fencing vs volatile. (Normally I'd suggest using `_Atomic` or `std::atomic<>` (you tagged both C and C++?), but they don't let you even try to efficiently copy an array efficiently, not even for a Seq Lock where you detect whether tearing was possible.) Speaking of which, if you do need any consistency between elements, you might want a seqlock. Like [Implementing 64 bit atomic counter with 32 bit atomics](//stackoverflow.com/q/54611003) but you may have to avoid volatile or use SIMD intrinsics. – Peter Cordes May 22 '21 at 09:19
  • Also, what's this `data_read`? Is it also global and written by other threads? Global data-ready flags shared between threads typically need release/acquire synchronization (free on x86, just limiting compile-time reordering is sufficient). Anyway, I have an answer half-written, but all this data-race UB left me wondering what's really going to make the C part of this safe, not just the likely asm in memcpy itself. – Peter Cordes May 22 '21 at 09:23
  • I should clarify. I understand that in case of a naive memcpy where each byte is copied in a loop, this will be trouble. But I understand that the compiler will use a better method (vector instructions etc.) on these platforms. – Syam May 22 '21 at 09:24
  • A mutex, cumbersome or otherwise is not a solution in any case. You cannot (or should not) take a mutex in an ISR. If you have more code of this nature, you have multiple thread/interrupt safety issues. – Clifford May 22 '21 at 09:45
  • @Syam Not the compiler, the library implementation (and perhaps the compiler that built the library). You are right it will probably work; but it is not a good idea to trust to "probably". – Clifford May 22 '21 at 09:49
  • @Syam: Yes, you said you're using GCC on Linux, so memcpy will go in chunks of 32 bytes for this large copy. I think even `rep movsb` microcode won't do any byte copies, especially for a size that's a multiple of 32. But you haven't addressed how you'd stop the compiler from doing unsafe optimizations on this data-race UB. That's as big (or a bigger) concern. – Peter Cordes May 22 '21 at 09:55
  • Also, note that interrupts are only a minor concern, unless you're running in a single-core VM. You normally have multiple threads running simultaneously on different cores, so even a single asm instruction isn't necessarily atomic from the PoV of other cores, even though it is wrt. interrupts (and context switches) on the same core. – Peter Cordes May 22 '21 at 09:58

2 Answers2

1

It would depend on the implementation of memcpy() - there are no guarantees. Even if you know the implementation makes this safe, it would be unwise to rely on it remaining so across all versions and platforms this code or pattern may get re-used on.

You might implement your own word-by-word 16 bit copy with a word copy that you know to be atomic. How to do that warrants a new question.

Clifford
  • 88,407
  • 13
  • 85
  • 165
  • I understand the portability issues. But here I am not looking for portability. I am always going to run on relatively modern processors (i7 or newer). So the question is can I get away with this on these platforms, even in the presence of compiler optimizations. – Syam May 22 '21 at 09:06
  • 1
    @Syam it is not a matter of "modern processors", but rather library implementation. The semantics of memcpy do not require atomicity at any word width. You might get lucky, but do you really want to trust to luck? It is naive to think that your code will not be reused, copied or learned from, or dangerous "habits" acquired that might not be safe in all your work. Just write your own "safe" copy routine that meets your required semantics. It is really not that hard. – Clifford May 22 '21 at 09:42
  • I fully agree with you. I also cringe at such 'works-but-conditions-apply' kind of solutions. This is not my design or code. We were wondering if we can get away with not using locks for this particular hardware/OS/compiler combination. – Syam May 22 '21 at 09:48
  • Well, I suggest this is a valid answer to your question, which is clearly an XY problem. You might to better to ask about a solution to your problem rather than problems with your solution. – Clifford May 22 '21 at 09:51
  • Thanks for your wise words. I agree with you 100%. But this is not a case of XY problem. This is an actual running application that we have. This issue of concurrent access came up and it is being debated if we can get away without a lock for this particular environment. – Syam May 22 '21 at 11:29
  • 2
    @Syam : How is that not an X-Y problem? You have asked about the properties of `memcpy()` the _answer_ to that question ("no") does not solve your problem - you are no further toward a solution by knowing the answer - so it is not the question you should be asking. "Can you get away with it" is a matter of opinion and attitude to risk. I would not dare judge - what if this were a safety critical system and my _opinion_ killed someone? If the C library is open source, you could take a look, but you need to be really sure that exact library will be used on all platforms/builds. – Clifford May 22 '21 at 11:46
1

Interrupts aren't really relevant unless you're running this on a single-core VM. On a normal system with a multi-core CPU, two threads can be running simultaneously on separate cores. This is why we have C++ std::atomic<> and C _Atomic which are useful for single variables like int.


It depends on your memcpy implementation. Any non-terrible one won't do any single-byte copies, and all the 16-bit loads/stores will actually be part of larger loads/stores (or possibly the internals for rep movsb microcode). It's hard to imagine how a sensible compiler (not a DeathStation 9000) would ever choose to inline a copy that could introduce tearing across a uint16_t boundary.

But if you don't do it manually (e.g. with AVX intrinsics), it is barely possible some weird optimization could get a compiler to do a byte load/store.

For a SIMD implementation like a normal library will use for small sizes, it comes down to Per-element atomicity of vector load/store and gather/scatter? - annoyingly there's no formal guarantee from either major x86 vendor (AMD or Intel). It's almost certain that it's safe, though, especially if the entire vector is itself aligned (so no cache-line splits or page splits). Using alignas(64) uint16_t global_buffer[128]; would be a good way to ensure that.

If your total copy size wasn't a multiple of the vector width, overlapping copies still won't introduce tearing within one uint16_t. Like the first 8 uint16_t and the last 8 uint16_t, for copy sizes from 8 (full overlap) to 16 (no overlap) array elements.

And BTW, that's basically what glibc memcpy does for small copies. A 4 to 7-byte memcpy is done with two 4-byte loads and 4-byte stores, 32 .. 63 bytes is done with 2x 32-byte vectors. (2 fully-overlapping avoids store-forwarding stalls when reading later, vs. two non-overlapping halves. The upper end might actually let it go up to 64 bytes with a pair of full-size AVX vectors.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • re: "2 fully-overlapping avoids store-forwarding stalls when reading later" Reading later when? Like after memcpy? If so isnt that dependendent on the actual read done by the user (if its aligned relative to the 32 byte store). Not really sure what you where getting at with that. (Also 32...63 -> 32...64 except maybe KNL. The range [2^N, 2^(N + 1) - 1] is only for sizes [0, VEC_SIZE - 1] with VEC usually be ymm. for [VEC_SIZE, 8 * VEC_SIZE], its [2^N + 1, 2^(N + 1)]). – Noah Jul 01 '21 at 20:52
  • 2
    @Noah: yes, reading after the memcpy. An 8-byte memcpy done in two 4-byte halves instead of with two redundant 8-byte stores is more likely to result in a SF stall if it's a `double` that gets loaded normally soon after. Of course, memcpy itself might hit a SF stall if it was two ints that were originally stored separately... – Peter Cordes Jul 01 '21 at 21:30