Reverse CAS where store only takes place if new value has not changed. Is there any way to achieve this atomically?

Question

Basically what I am looking for is something that will accomplish the following logic atomically.

#define FAILED 1
#define SUCCESS 0

int foo (uint64_t * src, uint64_t * dst, uint64_t expected) {
    if (*src == expected) {
        *dst = *src;
        return SUCCESS;
    }
    return FAILURE;
}

For this particular use case expected == 0 in all cases and *dst CANNOT be modified by any other thread. *src, however can be modified by other threads concurrently (those other threads CAN be on another cores otherwise I could use restartable sequences)

For this to be correct *dst CANNOT be modified if *src != expected (with expected == 0).

I have the invariant *src != *dst except if *src == *dst == 0 (this might be useful for finding a solution).

Finally if it enables any potential solutions I can gurantee that *src and *dst are either on the same cache line or different cache lines.

I don't believe any of the builtin atomic operation (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html) can achieve this so I think the only way to do this will be with some inline assembly using the lock prefix or to use some side effect of some function (i.e something along the lines of the fact that CAS will set expected on failure).

The summarize I am looking for either a way to implement a custom atomic operation with inline asm or some way I can use bultin atomics to achieve this.

Thank you!

Why `*dst = *src;` instead of `*dst = expected;`? The non-atomicity is just an accident of writing it in an unfortunate way — harold, Aug 03 '20 at 01:09
Because ```*src``` could have changed between the time of comparison and setting. I only wait to set ```*src``` if ```*dst``` has not changed. — Noah, Aug 03 '20 at 01:15
`*src` being briefly `expected` and changing back to non-expected, and `*dst` getting set to `expected` anyway, is consistent with this operation being atomic with the contraction point before `*src` is changed. — harold, Aug 03 '20 at 01:25
It could be observed in that partial state with a simd register — Noah, Aug 03 '20 at 01:43
It's not a partial state, and it doesn't matter (not to atomicity, it clearly matters to you). The states that can be captured are consistent with this operation being atomic, ie they look like "operation has not happened yet" or like "the operation has happened", there is no state that looks like this operation is partway done. The guarantee you seem to be after is something else than atomicity. — harold, Aug 03 '20 at 01:56
A normal CAS only has one memory operand! This is like a partial [DCAS aka CAS2](https://en.wikipedia.org/wiki/Double_compare-and-swap). You could emulate it with a wide CAS (like `lock cmpxchg16b`), but that could spuriously fail if the `dst` had changed. — Peter Cordes, Aug 03 '20 at 05:13
I think a source of your confusion is that “happens at the same time” is not a thing in multi-threaded programs. The operation you want does not need any atomic operations because as other's mentioned, there is no invalid partial state another thread could observe. Hence, if you implement this operation as `if (*src == expected) *dst = expected;` it will work just fine. If you want some more explanation on this, please let me know and I'll write an answer. — fuz, Aug 03 '20 at 10:19
Could you provide some context? Why do you need this kind of operation? This might help to come up with a solution... — mpoeter, Aug 03 '20 at 15:54
What are you seeking to avoid **exactly**? `A` sets `*src` to non-zero, then reads `*dst` as non-zero. Meanwhile `B` calls `foo`. Nobody writes to `*src`, but despite this `A` sees `*dst` as zero later. Is that an example of what you are trying to avoid? If that *isn't* a problem, then solving this is easy. If that specific issue is a problem, then a solution is harder. — Yakk - Adam Nevraumont, Aug 03 '20 at 20:28

score 3 · Answer 1 · answered Aug 03 '20 at 00:51

3

I don't understand what you want this primitive for, and I think it fundamentally doesn't make sense to call it "atomic". If there was any moment where *src was observed as having the value of expected, it's valid for the store to *dst to take place. There is no way for any other thread to observe atomicity of lack thereof for this operation. So just writing it as you've written it (albeit with an appropriate atomic type for *src) should be fine.

If you disagree with this reasoning, please explain the bigger picture of what you want to use the primitive in, since it's almost certainly wrong and the real problem is to understand how that's wrong and figure out something that works in its place. (And the solution is almost surely to stop trying to use atomics and use a lock.)

answered Aug 03 '20 at 00:51

R.. GitHub STOP HELPING ICE

208,859
35
376
711

Basically I want to set ```*src``` to 0 iff ```*dst``` is currently 0. If another thread modifies ```*dst``` I want the store operation to fail. So I need the ```if (*dst == 0)``` and ```*src = 0``` an all or nothing operation (thats atomic no?) – Noah Aug 03 '20 at 00:54
@Noah: Your comment seems to have `src` and `dst` reversed from the pseudocode in the question. Atomicity is about whether another thread can observe a partially-completed state of the operation. I don't see how that's relevant here. – R.. GitHub STOP HELPING ICE Aug 03 '20 at 01:00
Oh that comment does have ```src``` and ```dst``` flipped. The partially completed operation would be between the ```if (*src == 0)``` and ```*dst = *src```. The program requires the gurantee that in order to change any value in a set of values one of those values must be 0. That is what I am trying to accomplish w.o locks. – Noah Aug 03 '20 at 01:06
@Noah: How do you observe that "partially completed operation"? – R.. GitHub STOP HELPING ICE Aug 03 '20 at 01:18
```uint64_t * cache_line_with_src_and_dst;``` Thread A: 1) ```if(*src == 0)``` 2) ```*dst = *src``` Thread B: 1) ```*dst = ``` Thread C: 1) ```__m512i temp = *(__m512i *)cache_line_with_src_and_dst);``` A-1, B-1, C-1, A-2 – Noah Aug 03 '20 at 01:25
Could be any simd instruction really. – Noah Aug 03 '20 at 01:32
@Noah: So now you've changed the problem to wanting atomic AVX512 vectors, which AFAIK don't exist. I think you need to accept that this is a problem that requires a lock and use a lock. (This is almost always the answer to any question involving atomics.) – R.. GitHub STOP HELPING ICE Aug 03 '20 at 02:05
Is reading __m512i atomic? I don’t think that’s guaranteed. – prl Aug 03 '20 at 02:05
I dont think this entails atomic avx512 at any point C could copy the array and read it in the partial state of neither ```*dst``` nor ```*src``` being 0. – Noah Aug 03 '20 at 02:21
2

@Noah: That's not the partial state in question. Nothing about your desired primitive precludes both being nonzero. I can't know for sure, but I'm getting the impression that you keep confusing yourself about what you even want and what the supposed invariants of it are. – R.. GitHub STOP HELPING ICE Aug 03 '20 at 02:25
@prl: An aligned `__m512i` load is not *guaranteed* atomic in any future-proof way, but in practice on all current CPUs with AVX512 it is in practice believed to be atomic. Same for `__m256i` on Haswell+ and I think Zen2. Unfortunately vendors decline / neglect to provide CPUID feature bits to detect when this is the case on any given CPU, so taking advantage of it means you're on you're own. IDK if people do this for real in high-performance code. But anyway, you can trivially do an 8-byte atomic load of two 4-byte values, or an inefficient 16-byte atomic load with `lock cmpxchg16b`. – Peter Cordes Aug 03 '20 at 05:05
@prl: Caveat: it's not as simple as having 256-bit or 512-bit data paths between L1d and load/store execution units. [SSE instructions: which CPUs can do atomic 16B memory operations?](https://stackoverflow.com/a/7647825) shows a case where K10 Opteron can tear 16-byte load/store on 8-byte boundaries, but *only* between cores on different sockets, over HyperTransport. – Peter Cordes Aug 03 '20 at 05:08

score 1 · Answer 2 · answered Aug 03 '20 at 05:18

A normal CAS only has one memory operand! This is like a partial DCAS aka CAS2.

You could emulate it with a wide CAS (like lock cmpxchg16b) if you put src and dst adjacent to each other, but that could spuriously fail if the dst had changed.

Or you could use Intel TSX transactional memory, if you have a CPU where it's not disabled via microcode updates. (HLE is disabled in microcode for some kind of MDS vulnerability reasons, I think, but AFAIK RTM is still usable on CPUs where it wasn't disabled for correctness reasons, i.e. working on Skylake and later at least, maybe some Broadwell.)

Note that AVX-512 loads of a whole cache line are not guaranteed to be atomic, but on all current AVX-512 CPUs it's believed that they are in practice. Unfortunately vendors don't bother to provide any CPUID way to detect atomicity guarantees wider than 8 bytes even though 16-byte SIMD load/store atomicity is widespread. Beware that the interconnect between cores can cause tearing, not just a narrow load/store execution unit: SSE instructions: which CPUs can do atomic 16B memory operations? shows tearing on 8-byte boundaries on K10 Opteron only between cores in different sockets, thanks to HyperTransport.

Reverse CAS where store only takes place if new value has not changed. Is there any way to achieve this atomically?

2 Answers2