addl, $9, _x(%rip)
_x is a global variable. Essentially I'm not certain as to how adding to a global variable in this case is implemented and whether or not there are inherent race conditions with this line in a multi processor system.
As duskwuff pointed out, you need a lock
prefix.
The reason why is that:
addl $9,_x(%rip)
is actually three "micro operations" from the standpoint of the memory system [herein %eax
just for illustration--never really used]:
mov _x(%rip),%eax
addl $9,%eax
mov %eax,_x(%rip)
Here's a valid sequence of events. This is guaranteed by the lock
prefix. At the end, _x
will be 18:
# this is a valid sequence
# cpu 1 # cpu 2
mov _x(%rip),%eax
addl $9,%eax
mov %eax,_x(%rip)
mov _x(%rip),%eax
addl $9,%eax
mov %eax,_x(%rip)
But, without the lock
, we could get:
# this is an invalid sequence
# cpu 1 # cpu 2
mov _x(%rip),%eax
mov _x(%rip),%eax
addl $9,%eax addl $9,%eax
mov %eax,_x(%rip)
mov %eax,_x(%rip)
At the end, _x
will be 9. A further jumbling of the sequence could produce 18. So, depending on the exact sequencing between the micro ops on the two CPUs, we could have either 9 or 18.
We can make it a bit worse. If CPU 2 added 8 instead of 9, the sequence without lock
could produce any of: 8, 9, or 17
UPDATE:
Based on some comments, just to clarify terminology a bit.
When I said micro operations ... it was in quotation marks, so I was coining a term for purposes of discussion herein. It was not meant to translate directly to x86 uops as defined in the x86 processor literature. I could have [perhaps should have] said steps.
Likewise, although it seemed easiest and clearest to express the steps using x86 asm, I could have been more abstract:
(1) FETCH_MEM_TO_MREG _x
(2) ADD_TO_MREG 9
(3) STORE_MREG_TO_MEM _x
Unfortunately, these steps are carried out purely in hardware logic (i.e. no way for a program to see them or step through them with a debugger). The memory system (e.g. cache logic, DRAM controller, et. al.) will notice (and have to respond to) steps (1) and (3). The CPU's ALU will perform step (2), which is invisible to the memory logic.
Note that some RISC CPU arches don't have add instructions that work on memory nor do they have lock prefixes. See below.
Aside from reading some literature, a practical way to examine the effects is to create a C program that uses multiple threads (via pthreads
) and uses some C atomic operations and/or pthread_mutex_lock
.
Also, this page Atomically increment two integers with CAS has an answer I gave and also a link to a video talk given by another guy at cppcon (about "lockless" implementations)
In this more general model, it can also illustrate what can happen in a database that doesn't do proper record locking.
The actual mechanics of how lock
is implemented can be x86 model specific.
And, possibly, target instruction specific (e.g. lock
works differently if the target instruction is [say] addl
vs xchg
) because the processor may be able to use a more efficient/special type of memory cycle (e.g. something like an atomic "read-modify-write").
In other cases (e.g. where the data is too wide for a single cycle or spans a cache line boundary), it may have to lock the entire memory bus (e.g. grab a global lock and force full serialization), do multiple reads, make changes, do multiple writes, and then unlock the memory bus. This mode is similar to how one would wrap something inside a mutex lock/unlock pairing, only done in hardware at the memory bus logic level
A note about ARM [a RISC cpu]. ARM only supports ldr r1,memory_address
, str r1,memory_address
, but not add r1,memory_address
. It only allows add r1,r2,r3
[i.e. it's "ternary"] or possibly add r1,r2,#immed
. To implement locking, ARM has two special instructions: ldrex
and strex
that must be paired. In the abstract model above, it would look like:
ldrex r1,_x
add r1,r1,#9
strex r1,_x
// must be tested for success and loop back if failed ...
No. There is a tiny window between the processor reading the old value of _x
and writing the new value back; if another CPU writes to _x
at that exact moment, that value will be overwritten.
Adding the LOCK
prefix to the instruction will make the operation atomic.