Split the variable across a cache-line boundary. Then neither loads nor stores will be atomic, and you will get tearing in practice on all real CPUs.
e.g. in NASM syntax:
section .bss
align 64
resb 63 ; reserve 63 bytes
myvar: resd 1 ; reserve 1 dword (32 bits)
To make a test program that demonstrates this in practice, see SSE instructions: which CPUs can do atomic 16B memory operations? for an example.
Also, 80-bit x87 long double
is non-atomic on some hardware. 80-bit x87 fld
/ fstp
decode to 2 separate load or store uops (plus some ALU uops) (on Intel Sandybridge-family, for example), so probably the 64-bit part and the 16-bit part are separate cache accesses and you could get tearing for a long double
with any alignment even on CPUs where 16-byte SSE movaps [mem], xmm0
is atomic.
No Intel or AMD x86 manuals ever guarantee atomicity of anything wider than 64 bits (except for lock cmpxchg16b
), so this talk of SSE vector loads/stores being atomic on some CPUs isn't something that you can reliably take advantage of or detect when it's supported. (Although on some hardware (like probably Intel Haswell/Skylake, at least single-socket) even 32-byte YMM loads/stores will be atomic if they don't cross a cache-line boundary.)
See Why is integer assignment on a naturally aligned variable atomic? for the rules. Violate any of them and you can see tearing on some CPUs.
But for guaranteed non-atomicity on all SMP systems, crossing a 64B boundary will always work (technically you should check CPUID to find out the cache-line size, in case it's larger, but 64B has been standard since the last 32B cache-line systems (Pentium III)).
Super-guaranteed to definitely always work (except on a CPU design that's fundamentally different from current ones): split a 1GiB boundary, because that's the largest hugepage size. (Even 4k splits within a 2MB hugepage count as a page-split and need two TLB checks to find out that they are both in the same hugepage, with the associated performance penalties on current hardware. And of course any 4k split is also a cache-line split).
The one exception to all of this is uniprocessor machines, because a context-switch can't happen in the middle of an instruction. A mov
store or load either happens before an interrupt or it doesn't. (Uniprocessor makes even a read-modify-write like add [mem], 1
atomic with respect to other threads, although it's not with respect to DMA or MMIO observers. See supercat's answer on Can num++ be atomic for 'int num'?)