If the only CPU has the memory bus locked, no other device can read or change memory contents during that time, not even via DMA. (Or with multiple CPUs on a shared bus with no cache, same deal.) Therefore, no other memory operations at all can happen between the load and the store of a lock add [di], ax
for example, making it atomic wrt. any possible observer. (Other than a logic analyzer connected to the bus, which doesn't count.)
Semi-related: Can num++ be atomic for 'int num'? describes how the lock
prefix works on modern CPUs for cacheable memory, providing RMW atomicity without a bus lock, just hanging on to the cache line for the duration.
We call this a "cache lock"; all modern CPUs work this way for aligned locked
operations, only doing an expensive bus lock on something like xchg [mem], ax
that spans a boundary between two cache-lines. That hurts throughput on all cores, and is so expensive that modern CPUs have a way to make that always fault, but not other unaligned loads/stores, as well as performance counters for it.
Fun fact: xchg [mem], reg
has implicit lock
semantics on 386 and newer. (Which is unfortunate because it makes it unusable for performance reasons as just a plain load/store when you're running low on registers). It didn't on 286 or earlier, unless you did lock xchg
. This is possibly related to the fact that there were SMP 386 systems (with a primitive sequentially-consistent memory model). The modern x86 memory model applies to 486 and later SMP systems.