1
addl, $9, _x(%rip)

_x is a global variable. Essentially I'm not certain as to how adding to a global variable in this case is implemented and whether or not there are inherent race conditions with this line in a multi processor system.

SystemFun
  • 1,062
  • 4
  • 11
  • 21

2 Answers2

9

As duskwuff pointed out, you need a lock prefix.

The reason why is that:

addl $9,_x(%rip)

is actually three "micro operations" from the standpoint of the memory system [herein %eax just for illustration--never really used]:

mov     _x(%rip),%eax
addl    $9,%eax
mov     %eax,_x(%rip)

Here's a valid sequence of events. This is guaranteed by the lock prefix. At the end, _x will be 18:

# this is a valid sequence

# cpu 1                         # cpu 2
mov     _x(%rip),%eax
addl    $9,%eax
mov     %eax,_x(%rip)
                                mov     _x(%rip),%eax
                                addl    $9,%eax
                                mov     %eax,_x(%rip)

But, without the lock, we could get:

# this is an invalid sequence

# cpu 1                         # cpu 2
mov     _x(%rip),%eax
                                mov     _x(%rip),%eax
addl    $9,%eax                 addl    $9,%eax
mov     %eax,_x(%rip)
                                mov     %eax,_x(%rip)

At the end, _x will be 9. A further jumbling of the sequence could produce 18. So, depending on the exact sequencing between the micro ops on the two CPUs, we could have either 9 or 18.

We can make it a bit worse. If CPU 2 added 8 instead of 9, the sequence without lock could produce any of: 8, 9, or 17


UPDATE:

Based on some comments, just to clarify terminology a bit.

When I said micro operations ... it was in quotation marks, so I was coining a term for purposes of discussion herein. It was not meant to translate directly to x86 uops as defined in the x86 processor literature. I could have [perhaps should have] said steps.

Likewise, although it seemed easiest and clearest to express the steps using x86 asm, I could have been more abstract:

(1) FETCH_MEM_TO_MREG _x
(2) ADD_TO_MREG 9
(3) STORE_MREG_TO_MEM _x

Unfortunately, these steps are carried out purely in hardware logic (i.e. no way for a program to see them or step through them with a debugger). The memory system (e.g. cache logic, DRAM controller, et. al.) will notice (and have to respond to) steps (1) and (3). The CPU's ALU will perform step (2), which is invisible to the memory logic.

Note that some RISC CPU arches don't have add instructions that work on memory nor do they have lock prefixes. See below.

Aside from reading some literature, a practical way to examine the effects is to create a C program that uses multiple threads (via pthreads) and uses some C atomic operations and/or pthread_mutex_lock.

Also, this page Atomically increment two integers with CAS has an answer I gave and also a link to a video talk given by another guy at cppcon (about "lockless" implementations)

In this more general model, it can also illustrate what can happen in a database that doesn't do proper record locking.

The actual mechanics of how lock is implemented can be x86 model specific.

And, possibly, target instruction specific (e.g. lock works differently if the target instruction is [say] addl vs xchg) because the processor may be able to use a more efficient/special type of memory cycle (e.g. something like an atomic "read-modify-write").

In other cases (e.g. where the data is too wide for a single cycle or spans a cache line boundary), it may have to lock the entire memory bus (e.g. grab a global lock and force full serialization), do multiple reads, make changes, do multiple writes, and then unlock the memory bus. This mode is similar to how one would wrap something inside a mutex lock/unlock pairing, only done in hardware at the memory bus logic level

A note about ARM [a RISC cpu]. ARM only supports ldr r1,memory_address, str r1,memory_address, but not add r1,memory_address. It only allows add r1,r2,r3 [i.e. it's "ternary"] or possibly add r1,r2,#immed. To implement locking, ARM has two special instructions: ldrex and strex that must be paired. In the abstract model above, it would look like:

ldrex r1,_x
add r1,r1,#9
strex r1,_x
// must be tested for success and loop back if failed ...
Community
  • 1
  • 1
Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • Is there a way to see the "micro" operations you mention, or is that given by the op codes? Your answer is very helpful thank you! Followup, gcc, with O1 changes the mov, add, mov sequence to the instruction above. If what you say is true, then is this really an optimization? – SystemFun Feb 11 '16 at 07:15
  • There's a difference between 3 x86 instructions and one memory-destination x86 instruction which decodes to a read, modify, and write operations internally. For one thing, the architectural state has either done the add or not. An interrupt can't stop the sequence part way through, so it's atomic *on a single-core system*. And yes, it's an optimization: [the memory operand can micro-fuse with the add to take fewer uops in the out-of-order pipeline.](http://agner.org/optimize/) – Peter Cordes Feb 11 '16 at 10:58
  • @SystemFun: For example, on Intel SnB-family CPUs, `add m, r/i` only decodes to 2 uops in the fused domain. It's 4 total unfused uops, just like the load/add/store would be (store addr/data are separate), but that sequence would be 3 fused-domain uops. (Nobody's designed an x86 CPU that fuses separate x86 instructions other than branches.) Besides that, it obviously takes fewer machine-code bytes, which is always better (all else equal). – Peter Cordes Feb 11 '16 at 11:00
3

No. There is a tiny window between the processor reading the old value of _x and writing the new value back; if another CPU writes to _x at that exact moment, that value will be overwritten.

Adding the LOCK prefix to the instruction will make the operation atomic.

  • Where exactly is the read? Sorry, I'm new to assembly. Essentially to me it looks as if it is adding "in-place", which is why I thought it might be atomic. Does it load it into a register to add? And are there such things as "in-place" adds? – SystemFun Feb 11 '16 at 06:30
  • It's implied. Read `_x` from memory, add 9 to that, write the result back. There's no such thing as a truly "in place" operation; everything has to go to the CPU to be operated upon, even if it's written right back afterwards. –  Feb 11 '16 at 06:38