How come INC instruction of x86 is not atomic?

Question

I've read that INC instruction of x86 is not atomic. My question is how come? Suppose we are incrementing a 64 bit integer on x86-64, we can do it with one instruction, since INC instruction works with both memory variables and register. So how come its not atomic?

Well, it _is atomic_, if you prefix it with LOCK. Usually that's not what one wants, though, because it's quite expensive. Therefore you need to make explicit what you want. — Damon, Apr 11 '12 at 16:20
Atomic does not mean it's one instruction, it means it's one indivisable action. And `inc` with a memory operand isn't that, not by default anyway. — harold, Apr 11 '12 at 17:01

Kaganar · Accepted Answer · 2012-04-11T17:44:30.200

21

Why would it be? The processor core still needs to read the value stored at the memory location, calculate the increment of it, and then store it back. There's a latency between reading and storing, and in the mean time another operation could have affected that memory location.

Even with out-of-order execution, processor cores are 'smart' enough not to trip over their own instructions and wouldn't be responsible for modifying this memory in the time gap. However, another core could have issued an instruction that modifies that location, a DMA transfer could have affected that location, or other hardware touched that memory location somehow.

edited Apr 11 '12 at 17:44

answered Apr 11 '12 at 16:07

Kaganar

6,540
2
26
59

3

You should be a bit more clear about what "another operation" means. Certainly no other operation can happen on the same cpu core, only on other cores/cpus or other hardware fiddling around on the memory bus. – R.. GitHub STOP HELPING ICE Apr 11 '12 at 17:43

score 20 · Answer 2 · answered Apr 11 '12 at 16:59

20

Modern x86 processors as part of their execution pipeline "compile" x86 instructions into a lower-level set of operations; Intel calls these uOps, AMD rOps, but what it boils down to is that certain type of single x86 instructions get executed by the actual functional units in the CPU as several steps.
That means, for example, that:

INC EAX

gets executed as a single "mini-op" like uOp.inc eax (let me call it that - they're not exposed).
For other operands things will look differently, like:

INC DWORD PTR [ EAX ]

the low-level decomposition though would look more like:

uOp.load tmp_reg, [ EAX ]
uOp.inc tmp_reg
uOp.store [ EAX ], tmp_reg

and therefore is not executed atomically. If on the other hand you prefix by saying LOCK INC [ EAX ], that'll tell the "compile" stage of the pipeline to decompose in a different way in order to ensure the atomicity requirement is met.

The reason for this is of course as mentioned by others - speed; why make something atomic and necessarily slower if not always required ?

answered Apr 11 '12 at 16:59

FrankH.

17,675
3
44
63

4

The "mini-op" decomposition is irrelevant to atomicity, since a single cpu core cannot be interrupted mid-instruction. In fact inc with no lock prefix is perfectly atomic on single-core machines. It's only when other cores (or more obscurely, other hardware on the bus) could be accessing the memory that the lock prefix matters. – R.. GitHub STOP HELPING ICE Apr 11 '12 at 17:42
1

@R..: Argued like that, _any_ modify-mem cpu op on single cores were atomic no matter how done. But even single-core machines aren't "single" today because busmastering DMA / memory busses shared with peripherals ensure the presence of cache coherence and atomicity issues. There's always more than one memory bus client. Hence, load/stores are, on memory bus level, _always_ decomposed even if they happen as part of a "single" cpu instruction. Atomicity must be asserted (exclusive memory bus access); the CPU cannot execute a modify-mem as load/change/store but must bracket with bus lock/unlock. – FrankH. Apr 12 '12 at 09:23
@R..: ARM CPUs, for example, explicitly expose the bus lock need for atomixity on instruction set level via `LDREX`/`STREX`. Just from the fact x86 does have mem-modify instructions one cannot conclude the need for explicitness isn't there. Also, the question isn't about interrupting mid-instruction - that's not the same as atomicity. The decomposition matters strongly in that sense because the _single_ instruction's memory accesses can _race_ with those of other CPUs. The instruction completes (there's no trap requiring a restart) but the outcome (without `lock`) is not unique / determinate. – FrankH. Apr 12 '12 at 09:38
1

Normally you don't have other devices on the bus touching your program's memory; that's a really special case that only happens in hardware drivers. For most practical purposes, a load-modify-write instruction is atomic on single-core machines, and in fact plenty of code (Linux included) omits the lock prefix on archs that have load-modify-write instructions when built for use on non-SMP targets only because it was historically somewhat faster and perfectly safe for that usage case. – R.. GitHub STOP HELPING ICE Apr 12 '12 at 10:18
2

@R..: If you're saying that the non-atomicity without `lock` is of no consequence on single-core machines for the usecases where it matters (synchronization primitives), then I agree. But I disagree that this makes the "default" atomic - you just don't _need_ atomicity for _synchronization_ on single cores. – FrankH. Apr 12 '12 at 12:28
1

@FrankH.: On the machines I've seen, LDREX/STREX don't lock the bus; instead, they provide a means via which the "store" will be cancelled if anything on the bus might conflict with the most recent LDREX. STREX is allowed to spontaneously fail sometimes, provided that an LDREX/STREX combo with only a few instructions between has a likelihood of success. – supercat Feb 09 '14 at 00:58
1

@FrankH.: `inc [mem]` *is* atomic with respect to context switches on the same core: the whole instruction either happened before or after the interrupt that led to the context switch. If some but not all uops of the instruction have executed, all of them will be canclled when taking the interrupt to preserve this illusion of atomicity wrt. interrupts. Only MMIO devices or DMA reads can observe the non-atomicity, not other code on the same CPU. The important point is atomicity with respect to which observers, because a logic analyzer would of course see separate load/store (if uncached). – Peter Cordes Feb 25 '18 at 02:09
What would make `num+=1` non-atomic (wrt. other threads on a UP system) is if the compiler chose to load into a register, increment that, and then store with a separate instruction. This is a very common choice if the value is needed for future instructions. See https://stackoverflow.com/questions/39393850/can-num-be-atomic-for-int-num/39414316#39414316 and comments on it for more about this being useful in practice like @R.. was talking about. – Peter Cordes Feb 25 '18 at 02:31

score 1 · Answer 3 · edited May 29 '17 at 23:06

1

You really don't want a guaranteed atomic operation unless you need it, from Agner Fog's Software optimization resources: instruction_tables.pdf (1996 – 2017):

Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

edited May 29 '17 at 23:06

osgx

90,338
53
357
513

answered Apr 11 '12 at 16:51

Brett Hale

21,653
2
61
90

2

This information is surely outdated; A whole mutex lock/unlock cycle takes less than 90 cycles on one machine I tested, and involves multiple lock-prefixed operations and the rdtsc overhead. Testing with a single lock inc instruction between rdtsc's, I was unable to even measure it taking any time (same time as nop). On modern cpus, it seems that the lock prefix does not increase the time at all unless the memory is presently shared with other cores. – R.. GitHub STOP HELPING ICE Apr 11 '12 at 17:40
@R.. - x86-64 has been available since 2003, so it's probably a blanket statement. I am wondering how it would impact on a pending interrupt / ctx switch. – Brett Hale Apr 11 '12 at 18:54
3

@R.. Well, less than 90 and more than 100 are not that far apart :-) – Gunther Piez Apr 11 '12 at 20:57
1

That 90 ns encompasses: (1) rdtsc time (perhaps 40 ns, I forget the exact cost), (2) function call and return overhead, (3) one one but at least two lock-prefixed instructions (for obtaining and releasing the lock). Items (1) and (2) account for most of that time, leaving (3) at near-zero... – R.. GitHub STOP HELPING ICE Apr 12 '12 at 00:40
This notice at instruction_tables.pdf is really old. Even in http://www.agner.org/optimize/microarchitecture.pdf for Nehalem there is: "*Thread synchronization primitives, e.g. the LOCK XCHG instruction, are considerably faster than on previous processors.*". Only in earlier shared-bus based CPUs LOCK was the external pin of bus and globally locks memory; modern multicore CPUs with integrated memory controllers IMC and point-to-point intersocket channels (single socket too) uses cache-coherency protocol to do atomic operation, and it is done in some layer of the cache hierarhy.. Which one? – osgx May 29 '17 at 23:09

How come INC instruction of x86 is not atomic?

3 Answers3

Linked

Related