Are concurrent write operations into the same word transactional?

Question

If 0b000000000 and 0b11111111 written simultaneously into the same memory address may it end up to something like 0b10110011 or it always will become either 0b000000000 or 0b11111111? Is there any difference between CPU and GPU execution? Does it depend on bit depth so that 32-bit write on 16-bit hardware might end up in an overlap, but 16-bit write won't?

In general you're describing a race condition, but without way more details it is impossible to tell you the outcome — Aaron, Nov 23 '21 at 19:08
in general terms, on a CPU; memory transactions happen in the cache of a single core, which is then transacted into the higher levels of cache one line at a time (often 64 bytes or more at a time) if two cores write to the same "location" at the same time, there would basically be a cache collision when they both attempt to flush to L2 or L3. Resolving cache collisions is system dependent and complicated. — Aaron, Nov 23 '21 at 19:13
Most (all?) ISAs define naturally aligned word (GPR stores) as atomic. "write tearing" may be a good search term for finding more information. I think Intel x86 does not guarantee atomicity of 128-bit and larger (SIMD register) stores even if aligned even if some/all implementations do provide such atomicity. — , Nov 24 '21 at 01:19
[This](https://stackoverflow.com/questions/70048631) recent post should (partially) answer the question. Besides this, AFAIK all modern processors load/store byte-sized memory values atomically due to the DRAM (controllers). — Jérôme Richard, Nov 24 '21 at 13:36
@PaulA.Clayton: Intel *finally* got around to [documenting that the AVX feature bit implies 128-bit load/store atomicity](https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg280720.html) for aligned instructions, without having to use `lock cmpxchg16b` atomic RMW. So this retroactively enables use of `movaps` on older CPUs. (But unfortunately leaves Pentium/Celeron low-end models of recent microarchitectures like Skylake without anything to indicate 128-bit load/store atomicity.) Hopefully AMD will document the same guarantee, since it's probably true there as well. — Peter Cordes, Jun 23 '22 at 11:33
@PeterCordes: Didn't AMD have an edge case with multi-socket setups, where the granularity was only 64 bits? — MSalters, Jun 23 '22 at 11:41
@MSalters: Yes, but that was K10, no AVX. (The experimental test on [SSE instructions: which CPUs can do atomic 16B memory operations?](https://stackoverflow.com/a/7647825) was on a multi-socket Opteron 2435, showing tearing at 8-byte boundaries when data went between sockets over HyperTransport). I don't know if any Bulldozer-family multi-socket CPUs still used the same HyperTransport interconnnect. Supporting `lock cmpxchg16b` efficiently would want some way to get 16-byte atomicity without a bus lock; IDK how K10 managed that. Early K8 lacked that instruction. — Peter Cordes, Jun 23 '22 at 11:49

bazza · Answer 1 · 2022-04-03T19:16:32.233

It won't come out as 0b10110011, or anything else random. To do would imply that there are two data sources competing to drive some part of the microelectronics at the same time, which would mean one piece of electronics attempting to burn out the other. That's not a good idea.

So the microelectronics is nearly always designed so that access to a shared resource (like a bus) is arbitrated somewhere. With a classic bus, only one device at a time would be "output enabled", with all other devices being in a tri-state / high-impedance state and thus not fighting the active device. Thus two devices wanting to drive the bus at the same time would end up having their access serialised by the bus control logic, and so the final result would always be one of either 0b11111111 or 0b00000000.

In the case of a modern CPU it'll be a matter of the L3 control logic arbitrating between which core goes first (there may even be some tri-state / high-Z logic inside the cache controller, for all I know).

I say "nearly always" - there was a bug in the electronic design of one of Amstrad's 1980's computers (I forget which). With a few writes to registers on devices it was possible to get two devices driving the same bus at the same time. A short while later the machine would overheat and burn out (letting out smoke). A certain sort of kid would take delight in going into the electronics store, typing in the necessary commands into the display model and leaving...

In short, the microelectonics of any modern CPU / GPU will prevent the random mix, but as to which one is the final result is anyone's guess.

Of course, if the value being written is wider than the system word width, that changes things. For instance, if it were a 4bit system, you could end up with 0b11110000 or 0b00001111. Nowadays system word widths tend to be as wide or wider than any integer type, so that won't happen. But if it had been a 32bit system, and you were writing 0xffffffffffffffff from one core and 0x0000000000000000 from another, you could end up with 0xffffffff00000000 (or vice versa), because to write a 64 bit value would be two write cycles, which might in theory get jumbled up between the two cores.

Are concurrent write operations into the same word transactional?

1 Answers1