How does a compare and swap loop achieve atomicity?

Question

This page talks about a CAS loop in details: https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/

An example of fetch_multiply in C++:

uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier){
    uint32_t oldValue = shared.load();
    while (!shared.compare_exchange_weak(oldValue, oldValue * multiplier)){}
    return oldValue;
}

Essentially if the *memory value matches our oldValue then a newValue gets stored atomically, else oldValue get updated with *memory.

I have 2 questions:

1 - Why do we have to check if oldValue is still unchanged in memory? What happens if we just write newValue to memory? Are we trying to avoid overwriting or using an intermediate value from another thread?

2- Suppose this scenario with 2 threads:

Thread B is trying to store a non-aligned value non-atomically. Store tearing occurs.
Thread A attempts a swap.
Swap fails as oldValue doesn't match. An intermediate (teared) value from memory gets loaded to our oldValue.
Thread A does multiplication with an intermediate value and attempts another swap which succeeds.
Now Thread B writes the remaining of its values to the same location partially overwriting our previous write.

I'm assuming Thread B could operate with that much delay and if so not only we multiplied with an intermediate value, it even got partially overwritten afterwords and CAS did nothing.

I can't answer the "how it works" bit -- that's supported down to the hardware level and is probably architecture-specific, and I'd have to google it. But as to the why, you often don't want just any new value, you want a specific one based on the old one. For example, if you're trying to increment a value, then you read X, compute Y=X+1 in a register, and then want to only write Y _iff_ the old value is still X. If it's not, someone else beat you to the write, and you need to read the new value, add to it, and try the write again until you succeed. — yshavit, Jan 10 '22 at 03:57
But that's not what atomicity is about. atomicity does not care if old or new value end up being written (it could be either). it only cares about not dealing with intermediate values. — Dan, Jan 10 '22 at 03:59
An atomic compare-and-swap definitely cares about the old value. That's half the name of it! — yshavit, Jan 10 '22 at 04:00
It has to care about it to make sure the operation is atomic in this case. but by the definition of "atomicity" old value could technically be there, it's not a problem. — Dan, Jan 10 '22 at 04:01
Sure, and if the old value is still there, then the CAS will succeed and you go on your merry way. I'm sorry, but I'm not really sure what you're asking. You asked why a CAS would want to look at the old value. I gave a use case, and you seem to have rejected it; but CASes are there precisely for those sorts of use cases (there are others that are more "interesting" than just X+1, but the principle is the same). I agree that if you don't care about those use cases, you don't need to care about the old value; in those cases, you don't need a CAS at all. — yshavit, Jan 10 '22 at 04:05
I'm having a similar conversation in the bottom answer. So what's the purpose of a CAS loop then? is it to store something atomically? or to make sure the latest value gets stored? because these are 2 different concepts. — Dan, Jan 10 '22 at 04:12
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/240910/discussion-between-yshavit-and-dan). — yshavit, Jan 10 '22 at 04:13
Simple scenario: Assume you want to add an element at the end of a linked list (which may, for instance, represent a lock-free queue). Then, you first find the last element and then atomically replace its `next` pointer but **only if it is null**. That is, only if it successfully compares to `nullptr`. Otherwise, you shift to the next node (if some other thread added it meanwhile). You need a compare-exchange operation. How would you do the same with just atomic store? Another use cases: atomic increment of a floating-point variable, setting the thread-global new maximum, etc. — Daniel Langr, Jan 11 '22 at 13:38
Note that C++ doesn't allow you to work with unaligend objects/values. — Daniel Langr, Jan 11 '22 at 13:57

score 1 · Accepted Answer · edited Jan 11 '22 at 13:31

1

If the target value changes to something else we need to read in oldValue again or we will spin forever.

However the point of the CAS construction is you cannot ever observe an intermediate value in the shared location. A tear is impossible; shared.load() prevents it. This is implemented in hardware.

"What happens if we just write newValue to memory?" Then you don't have atomic access. Always follow the pattern.

"non-aligned value" if shared is non-aligned you have already introduced undefined behavior into your code even before talking about std::atomic. Non-aligned pointers cannot be safely de-referenced. For a normal * you just took a dependency on byte-addressable architecture, but this is a std::atomic. If it's not aligned you can fault even on x86.

edited Jan 11 '22 at 13:31

Dan

2,694
1
6
19

answered Jan 10 '22 at 03:56

Joshua

40,822
8
72
132

Just trying to talk more high level here. What do you mean `Then you don't have atomic access`? I guess if the `loaded` value was an intermediate value then we need to discard it and try again, that's why we do the check? – Dan Jan 10 '22 at 04:02
1

@Dan: Remove "intermediate value" from the concept. If the code is correct that simply can't happen. You go around the loop exactly when some other thread wrote a new final value to `store` while you were computing the new value. – Joshua Jan 10 '22 at 04:04
So what? let the other thread store what it wants.I have already loaded the old value I needed, I will do my multiplication and overwrite whatever the other thread did. What's the problem here? I guess this is what's confusing me. – Dan Jan 10 '22 at 04:06
@Dan: Here's the documentation for the function. "Multiply its first argument by the second argument, and return the original first argument." Your "overwrite what the other thread did" is exactly what you don't want to do. – Joshua Jan 10 '22 at 04:07
let's not focus on C++ here, i just used that as an example. What's the purpose of a CAS loop then? is it to store something atomically? or to make sure the `latest` value gets stored? because these are 2 different concepts. – Dan Jan 10 '22 at 04:10
@Dan: To detect when another thread stomped on your work and you need to start over. All multiple-CPU architectures provide either CAS or CS and either one can be implemented in terms of the other. – Joshua Jan 10 '22 at 04:12
So essentially, to make `fetch_multiply` in this case behave as a `single instruction` we just make sure nothing in memory gets changed while we are doing our work. That's what makes this atomic? – Dan Jan 10 '22 at 04:17
1

@Dan: That is correct. – Joshua Jan 10 '22 at 04:19
1

Hi again, regarding the correction you have made to the code, I was wondering why is it needed? as `compare_exchange_weak` will automatically update `oldValue` if it was different, you won't loop forever? – Dan Jan 10 '22 at 13:50
@Dan: Ah C++. I'm used to a slightly different behavior where it did not. – Joshua Jan 10 '22 at 15:22
Thank you for confirming. Would it be ok to modify the answer to reflect that? – Dan Jan 10 '22 at 15:30
That would be a good idea, yes. – Joshua Jan 10 '22 at 15:31
Would you be able to modify it whenever you get a chance, thanks again. – Dan Jan 10 '22 at 15:54
I have removed the code correction snippet. – Dan Jan 11 '22 at 13:32

Daniel Langr · Answer 2 · 2022-01-11T13:58:55.770

1

Why this would not work?

uint32_t fetch_multiply(std::atomic<uint32_t>& shared, uint32_t multiplier){
    uint32_t oldValue = shared.load();
    uint32_t newValue = oldValue * multiplier;
    shared.store(newValue);
    return oldValue;
}

Because between load and store, another thread may modify the value of shared.

Consider the problem:

std::atomic<uint32_t> shared{1};
std::thread t1{ fetch_multiply, std::ref(shared), 2 };
std::thread t2{ fetch_multiply, std::ref(shared), 2 };
t1.join();
t2.join();
std::cout << shared;

With the above implementation, the possible output of this program is 2. While the correct one (provided fetch_multiply should be synchronized) must be 4. The problem occurs when both threads first load the initial value 1. Then, they both store their local result 2.

edited Jan 11 '22 at 13:58

answered Jan 11 '22 at 13:51

Daniel Langr

22,196
3
50
93

Sorry i'm not sure what you are referring to. Ar you saying my `fetch_multiply` snippet is wrong? – Dan Jan 11 '22 at 13:57
@Dan You asked: _"What happens if we just write newValue to memory?"_ I answered this question. Why should be your `fetch_multiply` wrong? I didn't write anything like that. – Daniel Langr Jan 11 '22 at 13:58
I see, yes I concluded that to make `fetch_multiply` in this case behave as a single instruction we just make sure nothing in memory gets changed while we are doing our work. Now either thread could execute it first which is not a problem. – Dan Jan 11 '22 at 14:00
Just wanted to confirm with you that, either `t1` or `t2` can be executed first correct? Atomicity doesn't care which goes first. Also what do you mean `fetch_multiply` should be synchronized? Shouldn't it work as is? – Dan Jan 11 '22 at 14:06
@Dan Yes, atomicity itself doesn't care about the order of memory operations performed by different threads. Note that even there is nothing as their global order observed by all threads. Such an order is provided only with _sequential consistency_ (or _total store order_ for stores). – Daniel Langr Jan 11 '22 at 14:39
@Dan By "synchronized" I meant that multiplication should happen atomically, which is not the case with the write-only-based implementation. – Daniel Langr Jan 11 '22 at 14:40
I think this deserves a question on its own, I have posted it here: https://stackoverflow.com/questions/70668734/what-is-thread-synchronization-and-how-does-it-differ-form-atomicity – Dan Jan 11 '22 at 14:48

How does a compare and swap loop achieve atomicity?

2 Answers2