Behavior of x86 with concurrent atomic and non-atomic read and writes

Question

What is the behavior of x86 for something like this?

template <typename Integer>
void write(Integer&);

auto integer = std::uint64_t{0};
auto atomic = reinterpret_cast<std::atomic<std::uint32_t>*>(&integer);

auto one = std::thread{[&]() {
    write(integer);
}};
auto two = std::thread{[&]() {
    read(atomic.load());
}};

A 32 bit atomic load in x86 gets compiled to this (somehow the generated assembly does not include the lock prefix)

mov     edi, DWORD PTR [rdi]

(Assume little-endianness, which puts the atomic integer at the right place to observe the least significant 32 bits of the 64 bit integer)

Two related questions -

What does x86 describe as the behavior of such a scenario?
What would the write() function need to do in this case to break the above code? The width of the read/write in write() can be anything, not necessarily confined to 64 bits, assuming however, that there is enough padding to make it take up 64 bits. write() can also execute an atomic operation.

Context: The place I work at has code that does this, and nothing seems to be breaking at the moment, I am pretty confident that this is bad code and can break in some situation. However, I do not know much about x86 and hardware in general. I would like to get to the bottom of this behavior, figure out if this is a problem and potentially fix the problem. A strong reference from x86 (ARM would do too) would be great.

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/178104/discussion-on-question-by-curious-behavior-of-x86-with-concurrent-atomic-and-non). — Bhargav Rao, Aug 16 '18 at 01:42
@BhargavRao it is however relevant to understanding why the question has a downvote. And also has my response to the downvote. Why move it to chat? — Curious, Aug 16 '18 at 01:43
Comments are meant to be used to request clarifications or suggest improvements to the post, and the poster of the question needs to update their post in accordance with the comments. They are not for extended conversations. As the conversation continued for a long period, it has been moved to chat, where you can continue the discussion. — Bhargav Rao, Aug 16 '18 at 01:45
This would be undefined behavior, if it even compiled, which it won't because you are using `.` on a pointer. — Ben Voigt, Aug 16 '18 at 20:29
The TL:DR here is that there's no problem in asm: pure loads/stores are naturally atomic. The main problem is reliably / portably getting a C++ compiler to emit asm that accesses the same memory location with two different widths, with at least the bottom 32 bits of the store being atomic. Using a union with a lockless `atomic` isn't good on 32-bit ARM because we don't need the whole 64 bits to be atomic, only as wide as we load. (Not to mention that union type-punning is a GNU extension in C++, unlike C99). A union with `volatile` might work for gcc. https://godbolt.org/g/heE6UE — Peter Cordes, Aug 16 '18 at 21:32
But that would be using pre-C++11 lockless programming techniques that depend on compiler-specific behaviour that ISO C++ doesn't define. (e.g. that you can use `volatile` or other tricks to get the asm you want for multi-threaded programming, if you know which store widths are atomic on your target platforms. e.g. [Why is integer assignment on a naturally aligned variable atomic on x86?](https://stackoverflow.com/q/36624881). I have a half-written answer I can post if this gets reopened. (For now, your `atomic` pointer should be `static const` so it can optimize away.) — Peter Cordes, Aug 16 '18 at 21:34
@PeterCordes Thanks for the comment. One followup - is there any sequence of operations here in pure asm (atomic RMW's are candidates here too) that can make this code behave badly? — Curious, Aug 16 '18 at 23:54
@Curious: your C++ source is already badly behaved. But answering what I think you mean; yes, the compiler is free to compile `integer = a | (b<<16) | (c<<32)` into 3 separate 16-bit stores (if a, b, and c are 16-bit integers), so the result asm will not atomically store `a | (b<<16)` into the low half. `volatile` should probably help with that in compilers like gcc, probably making the compiler implement the assignment with a single access. — Peter Cordes, Aug 16 '18 at 23:57
@PeterCordes The question seems to have reopened, want to formalize your comments as a proper answer? — Curious, Aug 18 '18 at 03:03

Behavior of x86 with concurrent atomic and non-atomic read and writes

0 Answers0