Interrupts aren't really relevant unless you're running this on a single-core VM. On a normal system with a multi-core CPU, two threads can be running simultaneously on separate cores. This is why we have C++ std::atomic<>
and C _Atomic
which are useful for single variables like int
.
It depends on your memcpy implementation. Any non-terrible one won't do any single-byte copies, and all the 16-bit loads/stores will actually be part of larger loads/stores (or possibly the internals for rep movsb
microcode). It's hard to imagine how a sensible compiler (not a DeathStation 9000) would ever choose to inline a copy that could introduce tearing across a uint16_t boundary.
But if you don't do it manually (e.g. with AVX intrinsics), it is barely possible some weird optimization could get a compiler to do a byte load/store.
For a SIMD implementation like a normal library will use for small sizes, it comes down to Per-element atomicity of vector load/store and gather/scatter? - annoyingly there's no formal guarantee from either major x86 vendor (AMD or Intel). It's almost certain that it's safe, though, especially if the entire vector is itself aligned (so no cache-line splits or page splits). Using alignas(64) uint16_t global_buffer[128];
would be a good way to ensure that.
If your total copy size wasn't a multiple of the vector width, overlapping copies still won't introduce tearing within one uint16_t
. Like the first 8 uint16_t and the last 8 uint16_t, for copy sizes from 8 (full overlap) to 16 (no overlap) array elements.
And BTW, that's basically what glibc memcpy
does for small copies. A 4 to 7-byte memcpy is done with two 4-byte loads and 4-byte stores, 32 .. 63 bytes is done with 2x 32-byte vectors. (2 fully-overlapping avoids store-forwarding stalls when reading later, vs. two non-overlapping halves. The upper end might actually let it go up to 64 bytes with a pair of full-size AVX vectors.)