Atomic operations: under the hood

Question

How do atomic operations work, under the hood?

Are atomic operations so-called "wait-free"?

I'm seeking for a description of the "least common divisor" of atomic operations. What do all atomic operations share?

What programming language? What platform? Which atomic operation? What level of atomicity (file system, network, database, ...)? — Matt Ball, Jul 28 '11 at 18:10

score 11 · Answer 1 · answered Jul 28 '11 at 18:22

If we're talking about atomic operations that are used by synchronization mechanism (mutexes, semaphores etc) they have to be supported by the OS on single CPU machines and by the hardware on multi CPU.

On a single CPU machine an instruction sequence can be made "atomic" in the sense that it cannot be interrupted in the middle (for e.g. the timer interrupt which gives a switch to another thread) if interrupts are shut off. This means that synchronization primitives can be written quite simply once the CPU enters kernel mode and can access the interrupt control registers.

In a multi core machine it is more complex. Then the instructions have to be truly atomic, across all CPUs. This requires all CPUs, not only the one executing the atomic instructions, to flush relevant parts of their cache to RAM. This flushing is what makes synchronization so expensive on these architectures.

The instructions themselves take the form of "Bit test and set" in one operation. This is enough to implement a simple mutex. Even if two threads on different CPU/cores are executing the test and set operation on the same time on the same address, only one will get the result that the bit was unset and is now set. That thread is the one that owns the mutex.

I'm trying to get a conceptual "lowest common divisor" description, but thank you for your answer. — Pindatjuh, Jul 28 '11 at 18:28
When are interrupts *ever* disabled? Is that actually ever done in reality? Seems like such a mechanism would be a dangerous thing to leave lying around in user space. — Rag, Jul 28 '11 at 19:44
@Brian: Interruts can only be disabled in kernel space. They are commonly disabled inside kernels during critical operations and reenabled again. The most common situation is the automatic masking out of lower priority interrupts whenever an interrupt handler is run. — Anders Abel, Jul 28 '11 at 19:46

score 5 · Accepted Answer · edited Feb 11 '23 at 23:49

5

Atomicity as a concept occurs in several places, I suspect you're thinking about atomic operations in code, but there are other meanings.

One fundamental property of a Database transaction is Atomicity, see a description of ACID properties of transactions.

In this case, you have lots of database cleverness, locks, and so on, which almost certainly imply waiting when two threads of control (or two processes) want to get at the same data.

When you come to lines of code I guess you're thinking about a declaration (in some fictitious language)

global int x = 7;

in one thread

x = 25000;

print x;

and in another

print x;

Can we say anything about what the second thread will print? We might accept either 7 or 25000, we'd be less happy to get a number that was the high order byte of 25,000 and a low order byte of 7 - which conceptually would be the result of a non-atomic integer assignment.

Different programming languages are free to define whatever semantics they wish, it's conceivable that some would just accept whatever natural behaviors that the CPU they work on (say 32-bit int was atomic, 64 long was not) or they might do something much cleverer, and if the CPU itself doesn't provide atomic operations then I don't see much alternative to some kind of waiting if they want to fake atomicity - eg. Java synchronized keyword.

edited Feb 11 '23 at 23:49

Michael M.

10,486
9
18
34

answered Jul 28 '11 at 18:20

djna

54,992
14
74
117

Atomicity implies waiting for it's complete operation to finish, before it's result is visible to other interested parties. Is that a correct summary? – Pindatjuh Jul 28 '11 at 18:24
yes atomicity implies that you can't get a partial result, in some cases there's no possibility of interruption, so you're just "waiting" for a single CPU operation, in others you need an explicit wait, implemented by some extra code (eg. Java synchronized) and pay some overhead. I think you want to know when such overhead occurs, unfortunately there's no one answer, all language and platform dependent. – djna Jul 28 '11 at 18:32
An atomic operation is all or nothing. In those cases where the operation has started, if it's unable to finish, it backs out gracefully, ensuring that on failure everything is left as it was when it started. – MRAB Jul 28 '11 at 18:49
@Pindatjuh: Not just waiting, that doesn't usually help to get the required all-at-once or not-at-all behaviour, unless you're talking about using a lock to protect ad data structure, and waiting to unlock it until you're all done with your whole transaction on it. In assembly language, you may need to use special instructions. – Peter Cordes Feb 12 '23 at 06:49

score 0 · Answer 3 · answered Jul 28 '11 at 18:13

0

Depends on the atomic operation you're talking about. If you're talking about ISA-level stuff, "test and set" instructions are included in some popular ISAs, I think.

answered Jul 28 '11 at 18:13

Patrick87

27,682
3
38
73

2

What is "ISA"? (International Society of Automation, Independent Safeguard Authorization, International Surfing Agency, etc. is probably not it) – Pindatjuh Jul 28 '11 at 18:16
3

Instruction Set Architecture... like what assembly language programs are written in (well, mnemonics for machine instructions, but still). – Patrick87 Jul 28 '11 at 18:20
Thank you! I never liked acronyms: they are rather obscure. – Pindatjuh Jul 28 '11 at 18:21

score 0 · Answer 4 · answered Feb 12 '23 at 00:23

0

Can num++ be atomic for 'int num'? explains how modern x86 CPUs implement atomic RMWs by holding a cache lock (delaying cache-coherency responses for that cache line, so this core maintains exclusive ownership of it from the load to the store of the microcoded instruction).

Wait-free vs. lock-free is debatable on x86 or ARMv8.1 depending whether you're counting in retries or in clock cycles. But on LL/SC machines, RMW operations like .fetch_add need a retry loop because the fail on contention instead of having hardware wait. See Anything in std::atomic is wait-free?

Pure-load or pure-store operations are easier for CPUs to handle atomically; it just happens naturally in assembly language for loads and stores that are aligned and no wider than the load/store execution units, and/or the chunk size for transfers between caches. (In higher-level languages, you need stuff like C++ std::atomic<int64_t> to make sure the compiler actually uses a single instruction and doesn't optimize away the memory access. Also for ordering wrt. other operations.) See Atomicity on x86 for more details about pure-loads and pure-stores. Also pretty much applies to other architectures, except the choice of examples.

Atomicity for larger operations like database transactions isn't provided directly by hardware, rather by software. (Unless you have transactional memory like Intel's TSX (https://www.realworldtech.com/haswell-tm/), or PowerPC HTM in POWER8.)

This answer is intentionally just summarizing and linking to other answers for deeper details because much has already been written, no need to repeat it all here. Seriously go read some of the answers I linked on other questions if you're interested in a more low-level understanding of CPU hardware atomic operations.

answered Feb 12 '23 at 00:23

Peter Cordes

328,167
45
605
847

1

Probably the entire RMW has to execute at retirement (i.e. when it is no longer speculative), since you want to drain the store buffer and do the lock-unlock of the cache line close in time. In particular, I believe that only one cache line will be locked at one time, to avoid the possibility of deadlock with other cores trying to lock 2+ lines in reverse order. – BeeOnRope Feb 26 '23 at 01:41
1

However, this would seriously impair performance for many close atomic ops, especially those that miss in cache, so I believe on Intel there is essentially a prefetch op that can execute early in the normal "OoO flow" which makes the aforementioned "at retirement" execution as a fast as possible. The prefetch of course does not affect the atomicity or ordering semantics any more than a explicit prefetch would. AMD has other tricks up their sleeve. – BeeOnRope Feb 26 '23 at 01:44
1

@BeeOnRope: Ah, that makes sense. I remember you'd previously suggested Intel having uops that maybe run the load side early, at least speculatively, but good point about the store having to be non-speculative. Perhaps different from `prefetchw` since they didn't implement that feature for a long time (not until Broadwell; it decoded as a NOP for many years before that, which was also a change from even older CPUs.) – Peter Cordes Feb 26 '23 at 01:47
1

Yes, I previously described this a bit differently (that the load and store would be seperated in the OoO flow and rely on some machinery to detect when the load and store would in fact not be atomic) but I have received information which has updated by view to basically implied prefetch + locked load + ALU + store at retire. – BeeOnRope Feb 26 '23 at 01:52
As I recall, the prefetch has a predictor which can turn it off if the line is often lost before the RMW occurs, since that may be a common case for lines under heavy contention and in that case the prefetch behavior might cause catastrophically bad behavior (imagine many cores competing over the line). – BeeOnRope Feb 26 '23 at 01:58
It seems at least plausible that the op is similar to prefetchw: no.doubt they want it in E/M state. Adding private behavior embedded in an existing instruction is probably easier than deciding to expose a whole new instruction which need to be supported forever. – BeeOnRope Feb 26 '23 at 02:01

Atomic operations: under the hood

4 Answers4