Why does memory_order_relaxed use atomic (lock-prefixed) instructions on x86?

Question

On Visual C++ 2013, when I compile the following code

#include <atomic>

int main()
{
    std::atomic<int> v(2);
    return v.fetch_add(1, std::memory_order_relaxed);
}

I get back the following assembly on x86:

51               push        ecx  
B8 02 00 00 00   mov         eax,2 
8D 0C 24         lea         ecx,[esp] 
87 01            xchg        eax,dword ptr [ecx] 
B8 01 00 00 00   mov         eax,1 
F0 0F C1 01      lock xadd   dword ptr [ecx],eax 
59               pop         ecx  
C3               ret

and similarly on x64:

B8 02 00 00 00    mov         eax,2 
87 44 24 08       xchg        eax,dword ptr [rsp+8] 
B8 01 00 00 00    mov         eax,1 
F0 0F C1 44 24 08 lock xadd   dword ptr [rsp+8],eax 
C3                ret

I simply don't understand: why does a relaxed increment of an int variable require a lock prefix?

Is there a reason for this, or did they simply not include the optimization of removing it?

^{* I used /O2 with /NoDefaultLib to trim it down and get rid of unnecessary C runtime code, but that's irrelevant to the question.}

What do you expect? How do you think a relaxed memory order *atomic* increment should be implemented on x86?? — Yakk - Adam Nevraumont, Jun 15 '14 at 23:07
Because consume/acquire/release semantics only make sense on a processor with a weak memory model. Itanium and ARM are the practical examples. You are however using an x64 core, it blows a raspberry at the notion of being weak. Weak is for wimpy engineers that can't handle the power. — Hans Passant, Jun 15 '14 at 23:09
@Yakk: Uh I don't think it's possible to perform a non-atomic increment on an aligned `int` on x86 no matter how hard you try... what would be wrong if they just didn't include the `lock`? — user541686, Jun 15 '14 at 23:22
@HansPassant: My code is not using consume/acquire/release ordering. It's using relaxed ordering, so I'm not sure why you're talking about the other ones. I'm asking why the code would be wrong without the `lock` prefix. — user541686, Jun 15 '14 at 23:24
@Mehrdad: Because it wouldn't be atomic without the prefix. Consider that x86 allows access to unaligned values. What would happen if you just did an `inc` on a off-by-one address? — Zan Lynx, Jun 15 '14 at 23:26
@ZanLynx: So you think it's only because of alignment? Am I correct that for an aligned `int`, it's unnecessary? And furthermore, I think you're wrong even in the unaligned case, because [it doesn't seem like an atomic increment is even guaranteed to be atomic for unaligned addresses](http://stackoverflow.com/a/5178914/541686). — user541686, Jun 15 '14 at 23:27
http://stackoverflow.com/questions/1415256/alignment-requirements-for-atomic-x86-instructions — Zan Lynx, Jun 15 '14 at 23:29
@ZanLynx: You just linked me to the same page I linked you to. — user541686, Jun 15 '14 at 23:29
Sorry. But there it is clearly stated that LOCK DOES guarantee atomic unaligned access, although it is rather slow. — Zan Lynx, Jun 15 '14 at 23:30
@ZanLynx: So are you specifically saying for the aligned case `lock` is unnecessary? — user541686, Jun 15 '14 at 23:37
For x86 that appears to be true. On x86 anything written to memory without locks is visible to other processors although the order is undefined. It is bad portability to assume this however. — Zan Lynx, Jun 15 '14 at 23:40
@ZanLynx: Okay thanks. Note that portability is irrelevant here, the implementation of `atomic` is itself necessarily non-portable. — user541686, Jun 15 '14 at 23:42
If the increment is atomic, then 3 threads incrementing a value 1 million times each gives a result of 3 million, even under relaxed. Without a lock, some of those threads will increment a local cached copy of the integer, and when written back the value can be anywhere from 1 to 3 million, or am I missing something with the x86/64 memory model? — Yakk - Adam Nevraumont, Jun 15 '14 at 23:53
@Yakk: You know, actually, I'm not quite sure to be honest. `xadd` is a single instruction, so I'm not sure what guarantees Intel makes regarding its atomicity on x86 -- for example, I remember `xchg` has a peculiar atomicity property even without a `lock` prefix, but I'm not sure if `xadd` does too. It seemed plausible that it does, but also that it doesn't. — user541686, Jun 16 '14 at 00:04
@Yakk: Actually -- I'm not sure what you said is correct. (I'm not saying it's wrong, I'm literally saying I'm not sure.) In the scenario you mentioned, what would be the difference if instead of `memory_order_relaxed` we used `memory_order_seq_cst`? — user541686, Jun 16 '14 at 00:06
@Yakk, last time I did ++x with multiple threads instead of some sort of atomic increment I got crap for an answer. I'm guessing xadd is internally a read, increment, store. — johnnycrash, Jun 16 '14 at 03:31
@johnnycrash: Did you mean to reply to me instead? Your results seem to be agreeing with what Yakk said. — user541686, Jun 16 '14 at 03:35
Alignment is irrelevant. A plain old `inc [mem]` is not atomic even when aligned. We're not in the single-core days anymore. — harold, Jun 16 '14 at 09:58
@Mehrdad: Re: "difference between `relaxed` and `seq_cst`": `memory_order_relaxed` guarantees that operations on the individual `atomic` variable are carried out in a specific total order, `memory_order_seq_cst` guarantees a total order over *all* `memory_order_seq_cst` operations, and ensures that writes that happened before a store in one thread happen before a load in another thread. As for your question, I'm pretty sure that the `lock` instruction is emitted to support `i386`. `int` loads are only guaranteed to be atomic from `i486` onwards. — Mankarse, Jul 31 '14 at 07:03
@Mankarse: No, the lock prefix is needed to join the load and store together with an ALU operation into one single atomic RMW operation, not for 386 compat. Compilers do use plain `mov` loads for `var.load(relaxed)` (or seq_cst for that matter). See [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) — Peter Cordes, Jun 18 '20 at 00:31

VoidStar · Accepted Answer · 2015-03-25T07:20:33.517

7

Because a lock is still required for it to be atomic; even with memory_order_relaxed the requirement for increment/decrement is too strict to be lockless.

Imagine the same thing with no locks.

v = 0;

And then we spawn 100 threads, each with this command:

v++;

And then you wait for all threads to finish, what would you expect v to be? Unfortunately, it may not be 100. Say the value v=23 is loaded by one thread, and before 24 is created, another thread also loads 23 and then writes out 24 too. So the threads actually negate each other. This is because the increment itself is not atomic. Sure, load, store, add may be atomic on their own, but incrementing is multiple steps so it is not atomic.

But with std::atomic, all operations are atomic, regardless of the std::memory_order setting. The only question is what order they will happen in. memory_order_relaxed still guarantees atomicity, it just might be out of order with respect to anything else happening near it, even operating on the same value.

edited Mar 25 '15 at 07:20

answered Feb 27 '15 at 08:34

VoidStar

5,241
1
31
45

3

Many non-x86 ISAs can do atomic read-modify-write without forcing a full memory barrier. e.g. ARM, like many RISC ISAs, has a load-linked/store-conditional. Unless you use a barrier instruction, or ARM64 load-acquire, you get an atomic operation that doesn't order other memory operations. So as you say, it's **just an x86 quirk that the only way to do atomic RMW operations is also a full memory barrier**. – Peter Cordes Dec 06 '15 at 16:50
Just realized I never accepted an answer here. For anyone reading this almost a decade later, I think my confusion (at the time) was due to the fact that [`xchg` is atomic without a LOCK prefix](https://stackoverflow.com/a/3144453/541686), so I thought other similar instructions would be too - but they aren't. That's all. – user541686 Dec 31 '22 at 08:20

pqnet · Answer 2 · 2020-06-18T01:26:00.867

1

Atomic operations, even with relaxed ordering, still have to be atomic.

Even if some operations on current CPUs were atomic without a lock prefix (hint: they are not, due to multi core caches), that wouldn't be guaranteed for future CPUs.

It would be shortsighted to have all your binaries fail horribly on the newest architecture just because you wanted to optimize a byte out of your binary relying on a feature that is not part of the assembly specification (and thus not guaranteed to be preserved in future x86_64 architectures)

Of course in this case multi-core systems are widespread so you do in practice need a lock prefix for it to work on current CPUs. See Can num++ be atomic for 'int num'?

edited Jun 18 '20 at 01:26

answered Jun 17 '20 at 23:07

pqnet

6,070
1
30
51

*Even if the current implemented CPUs do atomic updates even without lock prefix* - but they aren't atomic without `lock`, unless you're talking about uniprocessor systems. (Single core, single socket). See [Can num++ be atomic for 'int num'?](https://stackoverflow.com/q/39393850) This answer sounds like it was written in 1990 or something! (Or early 2000s if you ignore multi-socket SMP machines) – Peter Cordes Jun 18 '20 at 00:28
@PeterCordes That's not what I meant. Point is, "let's not enter the discussion about whether that operation is atomic or not. You may even bring proof that for every available CPU it is atomic. It doesn't matter, since you should adhere to the specification because future CPU might break that assumption" – pqnet Jun 18 '20 at 01:15
We both fixed the phrasing at almost the same time; mine was slightly later and overwrote yours. Roll back my edit if you like your wording better. It's a valid point to make that compilers have to respect *documented* guarantees, not just current behaviour. (Unfortunate for 16-byte aligned atomic load/store; in practice it's atomic on modern CPUs but vendors refuse to document that so `atomic<__int128>` has to use `lock cmpxchg16b` for pure-load and pure-store, not SSE.) – Peter Cordes Jun 18 '20 at 01:21
@PeterCordes you are right though in that it seems to imply that incrementing is atomic even without lock. I did change the wording to be more explicit and avoid confusing readers – pqnet Jun 18 '20 at 01:21
1

@PeterCordes edits on stackoverflow are not atomic either... – pqnet Jun 18 '20 at 01:23
1

They're atomic store operations, not RMWs :P I did see the page shift when the "post has been edited" bar pop up just as I was moving the mouse over to "save changes", so we did actually have sequential consistency; I just hit save anyway intending to check out what edit I'd stepped on. – Peter Cordes Jun 18 '20 at 01:26

score -1 · Answer 3 · edited Jul 31 '14 at 06:40

First, for reference, consider a normal assignment. It generates the following on Intel/64:

// v = 10;
000000014000E0D0  mov         eax,0Ah  
000000014000E0D5  xchg        eax,dword ptr [v (014001BCDCh)]

Then consider a relaxed assignment:

// v.store(10, std::memory_order_relaxed);
000000014000E0D0  mov         dword ptr [v (014001BCDCh)],0Ah

Now, std::atomic::fetch_add() is a Read-Modify-Write operation and it makes little sense to do this in a "dirty" way. You are, by default, getting std::memory_order_seq_cst as per http://en.cppreference.com/w/cpp/atomic/atomic/fetch_add. So, I think, it makes sense to generate a single native instruction for this. At least on Intel/64 where it is cheap:

// v.fetch_add(1, std::memory_order_relaxed)
000000014000E0D0  mov         eax,1  
000000014000E0D5  lock xadd   dword ptr [v (014001BCDCh)],eax

After all, you can achieve what you want by explicitly writing two operations that the compiler will have to honor:

// auto x = v.load(std::memory_order_relaxed);
000000014000E0D0  mov         eax,dword ptr [v (014001BCDCh)]  

// ++x;
000000014000E0D6  inc         eax  

//v.store(x, std::memory_order_relaxed);
000000014000E0D8  mov         dword ptr [v (014001BCDCh)],eax

This completely misses the point. Even with `mo_relaxed`, atomic fetch_add guaranteed that the add is done atomically. 1000 total increments done by multiple threads will have the same effect as `v+=1000`. relaxed vs. acquire/release or seq_cst affects how *other* memory operations can be reordered around the atomic operation. — Peter Cordes, Dec 06 '15 at 16:47

Why does memory_order_relaxed use atomic (lock-prefixed) instructions on x86?

3 Answers3

Linked