Is it possible to read a value from memory being written by another thread, so that it's neither the original nor final?

Question

Suppose we have a variable in memory, which is constantly being updated by a thread of execution by doing something like MOV into it with alternating values (single update is done by a single instruction). Is it possible that another thread, reading this same variable, will get neither of the alternative values, but something different — maybe half-one, half-alternative or whatever?

If it's possible, how can I reproduce such a situation? Can misalignment or some other special placement of the variable help achieve this?

If it's impossible on modern CPUs, was it possible on some older ones?

If you store a word (i.e. 16-bit) and read a dword (i.e. 32-bit) that contains the previous store then the store buffer will serve the updated 16 bits and the cache will serve the old value (before the updated one is written to mem). This is explained in [this answer](https://stackoverflow.com/questions/35830641/can-x86-reorder-a-narrow-store-with-a-wider-load-that-fully-contains-it) that is based on [this mail thread](http://yarchive.net/comp/linux/store_buffer.html). — Margaret Bloom, Aug 25 '17 at 14:33

Peter Cordes · Accepted Answer · 2017-08-25T18:14:53.317

Split the variable across a cache-line boundary. Then neither loads nor stores will be atomic, and you will get tearing in practice on all real CPUs.

e.g. in NASM syntax:

section .bss
align 64
       resb 63   ; reserve 63 bytes
myvar: resd 1    ; reserve 1 dword (32 bits)

To make a test program that demonstrates this in practice, see SSE instructions: which CPUs can do atomic 16B memory operations? for an example.

Also, 80-bit x87 long double is non-atomic on some hardware. 80-bit x87 fld / fstp decode to 2 separate load or store uops (plus some ALU uops) (on Intel Sandybridge-family, for example), so probably the 64-bit part and the 16-bit part are separate cache accesses and you could get tearing for a long double with any alignment even on CPUs where 16-byte SSE movaps [mem], xmm0 is atomic.

No Intel or AMD x86 manuals ever guarantee atomicity of anything wider than 64 bits (except for lock cmpxchg16b), so this talk of SSE vector loads/stores being atomic on some CPUs isn't something that you can reliably take advantage of or detect when it's supported. (Although on some hardware (like probably Intel Haswell/Skylake, at least single-socket) even 32-byte YMM loads/stores will be atomic if they don't cross a cache-line boundary.)

See Why is integer assignment on a naturally aligned variable atomic? for the rules. Violate any of them and you can see tearing on some CPUs.

But for guaranteed non-atomicity on all SMP systems, crossing a 64B boundary will always work (technically you should check CPUID to find out the cache-line size, in case it's larger, but 64B has been standard since the last 32B cache-line systems (Pentium III)).

Super-guaranteed to definitely always work (except on a CPU design that's fundamentally different from current ones): split a 1GiB boundary, because that's the largest hugepage size. (Even 4k splits within a 2MB hugepage count as a page-split and need two TLB checks to find out that they are both in the same hugepage, with the associated performance penalties on current hardware. And of course any 4k split is also a cache-line split).

The one exception to all of this is uniprocessor machines, because a context-switch can't happen in the middle of an instruction. A mov store or load either happens before an interrupt or it doesn't. (Uniprocessor makes even a read-modify-write like add [mem], 1 atomic with respect to other threads, although it's not with respect to DMA or MMIO observers. See supercat's answer on Can num++ be atomic for 'int num'?)

Yeah, it [does indeed work](http://coliru.stacked-crooked.com/a/596e73b584523ec1), thanks. — Ruslan, Aug 26 '17 at 10:04

Is it possible to read a value from memory being written by another thread, so that it's neither the original nor final?

1 Answers1

Linked