Is a read of 8 bytes on modern intel x86 guaranteed to by sane if 8byte written by different thread?

Question

struct Data {
    double a;
    double b;
    double c;
};

Would the read of each double be sane, if read on a different thread, but only one other thread is writing to each of a,b,c?

What is the scenario if I ensure Data is aligned?

struct Data {double a,b,c; } __attribute__((aligned(64));

This would ensure each of the a,b,c are aligned to 64,64+8, 64+16... so always aligned to 8*8=64 bit boundary.

This question alignment requirements for atomic x86 instructions and it's answer makes me think that it's perfectly valid to write to Data::a/b/c from another thread and simultaneously read them without use of std::atomic.

Yes I know std::atomic would solve this, but that's not the question.

It would be safe IF there was a guarantee that the compiler would use those atomic x86 instructions. Meeting the alignment requirements does not mean the compiler will use those instructions. Providing such a guarantee would mean lost optimisation opportunities, so I'd suggest it is unlikely to be the default. So, unless you write your code to ensure thread safety (e.g. using `std::atomic`) there is no guarantee that a read would be "sane" if a write of the same variable occurs on another thread. — Peter, Nov 27 '19 at 05:51
So what actually *is* the question, it mixes machine level semantics and C++ in an odd way, and the answer for each is opposite, because C++ likes to take away guarantees that the hardware gave you. — harold, Nov 27 '19 at 05:58
@harold I am under specific hardware, intel i7, in this case... and I was wondering if I could simply use a struct of aligned doubles and read and write from/to them without atomics, since my requiredment is only a sane read for aligned 8 bytes. — themagicalyang, Nov 27 '19 at 06:01
@themagicalyang in assembly you can do that. In C++, well maybe, but the compiler can ruin things. Overly hypothetical things aside, it may move a store out of a loop and do it only once at the end, or move a load out of a loop and do it only once at the start - then there are no torn loads or anything like that but your threads wouldn't be communicating. — harold, Nov 27 '19 at 06:13

Peter Cordes · Accepted Answer · 2019-11-27T06:36:13.657

Yes, aligned 8-byte loads/stores are guaranteed atomic by the x86 ISA, since P5 Pentium. Why is integer assignment on a naturally aligned variable atomic on x86?

But this is C++; there's no guarantee that stores and reloads aren't optimized away. Write in one thread and read in another is C++ Undefined Behaviour; compilers are allowed to assume it doesn't happen, breaking naive assumptions. This lets them keep C++ objects in registers across multiple reads/writes, only eventually storing the final value. (including global variables or memory pointed-to by some pointer.)

Since you didn't already know that volatile or atomic<double> are needed for this reason, better read about the other things that atomic<> does for you, like ordering wrt. other operations unless you use memory_order_relaxed (the default is seq_cst which makes stores expensive, but on x86 loads are still just as cheap). And (like volatile) the assumption that other threads might have modified an object between accesses in this thread. See Can num++ be atomic for 'int num'?, some of which is relevant for FP loads and stores.

Lockless programming in C++ is not simple, unless you have zero need for synchronization / ordering. Then you "just" have to make sure you tell the compiler what you mean, with atomic<T>, or as a hack with double.

Since GCC's std::atomic<double> with mo_relaxed doesn't compile efficiently, you might want to roll your own by making members volatile if you only care about portability. (or even casting to (volatile double*) like the Linux kernel's READ_ONCE / WRITE_ONCE macros). With clang you can just use atomic<double> with memory_order_relaxed and things will compile efficiently. See C++20 std::atomic<float>- std::atomic<double>.specializations for example of what you can do before C++20; C++20 only adds atomic RMW add/sub for double so you don't have to roll your own with a CAS loop.

volatile will probably still defeat auto-vectorization, but you can of course use _mm_load_pd or whatever. (See also Atomic double floating point or SSE/AVX vector load/store on x86_64 - note that SIMD load/store aren't necessarily atomic even if aligned. Also undocumented is whether they're per-element atomic, although that is I think safe to assume. Per-element atomicity of vector load/store and gather/scatter?)

When to use volatile with multi threading? normally never, except maybe as a workaround for GCC which won't emit efficient asm for atomic<double>, and where we know exactly how volatile compiles to asm.

BTW, you only need alignas(8) to make sure members are 8-byte aligned. Aligning the struct to a whole cache line doesn't hurt, unless it wastes space.

For performance: if different threads are using different variables in the same cache line, that's "false sharing" and terrible for performance. Don't group your shared variables together in one struct unless they're usually read or written as a group. Otherwise you definitely want them in separate 64-byte cache lines.

Note that a data-race on a volatile is still ISO C++ undefined-behaviour, but if you're using GNU C (as required by your __attribute__), it's pretty much well defined. The Linux kernel uses it for its own hand-rolled atomic (along with inline asm for barriers) so you can assume it's not going to be intentionally unsupported any time soon.

TL:DR: in GNU C it does more or less work to think of volatile as atomic with mo_relaxed, for aligned objects small enough to be naturally atomic.

I 'd add that if different threads are accessing different members, false sharing is something to consider. — AProgrammer, Nov 27 '19 at 06:27
@AProgrammer False sharing is only when different threads write right? If only one thread writes and others read then there is no issue. — themagicalyang, Nov 27 '19 at 06:55
@themagicalyang: Not *quite*. If thread A is chugging along, re-reading `double a` every iteration in case it changed, that can hit in cache. Thread B storing to `double b` in the same line as `a` will invalidate that line in all other cores, interfering with thread A even if thread A never reads `b`. — Peter Cordes, Nov 27 '19 at 06:57

Is a read of 8 bytes on modern intel x86 guaranteed to by sane if 8byte written by different thread?

1 Answers1