Establish a release atomic ordering on an atomic object without writing into it

Question

I'm using Rust, but Rust implements the C++ atomic memory model, so I will present my question in C++.

I have an atomic object M. I want to issue an pseudo load/store operation on M, so that the store this operation will "read from" will happen-before this operation, and all loads that will "read from" this store will happen-after this operation. Basically, I want a memory_order_acq_rel, but without changing the value of M.

The first part is easy: a memory_order_acquire load will suffice. But for the second part I need a memory_order_release store, and I don't know the current value of the atomic so I can't store it.

I know I can implement that with a compare-exchange loop that loads the value of M and stores it back:

void create_acq_rel(std::atomic<int>& object)
{
    int value = object.load(std::memory_order_acquire);
    while (!object.compare_exchange_weak(
        value, value,
        std::memory_order_release, std::memory_order_acquire
    ))
    {
    }
}

However, an obvious downside of this approach is that it generates a compare-exchange loop for no real need. Is it possible to implement that more efficiently?

At first I thought fences could help, but it seems like fences need actual load/store to synchronize, is this true?

I don't want to change the code this should synchronize with before/after, only this part of code because I think this will be simpler (I even prefer the compare-exchange loop to changing that code, because a) it is lot more code and b) it is in the hot path while this code is not).

Context: I have two lock-free linked list (a list of partially empty chunks and a list of full chunks, in an arena). Threads mostly traverse the first list (to find a place to allocate), but I may move an element from the first list to the second list (when a chunk becomes full) and a thread currently traversing it will continue its traversal in the second list. The first list is fully synchronized on the list head: adding new elements to the list is done only after the initialization of all previous elements, so I can be sure that threads traversing this list will only visit fully initialized elements, as they load the list head and its element is initialized before it is put into the list and all elements after it (I append at the beginning of the lists) are initialized before it. But sometimes it happens that I append an element directly to the second list (when an element is too big to fit in a chunk, I allocate a chunk specifically for it), and now threads that were traversing the first list, and continued their traversal in the second list, may see it uninitialized because it is not synchronized with the first list's head as the other elements. To fix that issue, I want the addition of this element to participate in the elements initialization chain, so initialization of prior elements happen-before it and it happens-before initialization of future elements. I know there can be other ways to synchronize it (for example, by synchronizing on the next pointers), but as I said, I want to touch only the code appending the element directly to the second list.

Use `fetch_add`? `M.fetch_add(0, std::memory_order_acq_rel);` — NathanOliver, Jan 31 '23 at 18:49
@NathanOliver This is a genius idea! Can you post it as an answer? — Chayim Friedman, Jan 31 '23 at 18:52
I think you misunderstand acquire/release semantic. You can't synchronize this way without changing value of atomic. Sure, you can prevent compiler optimizations around atomic operations, but different threads won't be synchronized. — sklott, Feb 01 '23 at 07:13
@sklott This is what I am asking, whether there is a way without changing the value of the atomic. — Chayim Friedman, Feb 01 '23 at 08:41
I think it's inherently impossible. The purpose of an acquire-release pair is to synchronize, and the acquire load synchronizes with whichever release store it was that stored the value that was loaded. In this case, since this store will store the same value as the previous one, you will never know which one you have synchronized with. It might always be the previous one. So this "no-op" exchange gains you nothing and it really is a no-op. — Nate Eldredge, Feb 01 '23 at 15:26
I find it helpful to remember that acquire-release, in and of itself, never *makes* one operation happen before another. It allows you to *find out*, based on the value loaded, *whether* one operation happened before another, and you can then make further deductions about the ordering of other operations. Here, since you can't distinguish between two stores that stored the same value, you can't learn which one happened-before your load. — Nate Eldredge, Feb 01 '23 at 15:28
Memory model analysis is a lot easier with context. Do you have a MRE version of your list algorithm that you can share? — Nate Eldredge, Feb 01 '23 at 15:35
@NateEldredge But I don't care whether I come before the first, or second, or third, or none. If I always synchronized with the previous store, that means the acquire load also loaded the previous store, which just means I come last, but I still in the virtual queue of operations. I have the complete code; I can try to construct a MRE. — Chayim Friedman, Feb 01 '23 at 17:35
From your description, I am still not sure I understand the problem to be solved. When you move an element from the first to the second list, surely that involves loading the head pointer of the second list (to store to the `next` of the element being moved)? And you say all initialization of elements in list 2 was guaranteed visible before head 2 was updated. So, if all these operations are acquire/release, it would seem to follow that anybody entering the second list via the moved element would see all elements fully initialized. — Nate Eldredge, Feb 06 '23 at 07:09
@NateEldredge All operations in **list 1** are synchronized, not in list 2. — Chayim Friedman, Feb 06 '23 at 07:58

score 4 · Accepted Answer · answered Jan 31 '23 at 18:56

4

You can use fetch_add to add 0 to the value like

M.fetch_add(0, std::memory_order_acq_rel);

This performs a read-modify-write operation and memory is affected according to the value of the order specified in the second parameter.

answered Jan 31 '23 at 18:56

NathanOliver

171,901
28
288
402

2

Interesting, [on x86-64 clang compiles this to a nop](https://cpp.godbolt.org/z/sn1oeWd6e) (gcc does not). Do you think it's a miscompilation? – Chayim Friedman Jan 31 '23 at 19:08
@ChayimFriedman I was a little worried about that. It might be that it's able to do that "optimization" since it knows there is no way the atomic object will be modified from another thread. If it is being an issue you could also try using `M.store(M.load(std::memory_order_acquire), std::memory_order_release);`. Hopefully someone else can help figure this out – NathanOliver Jan 31 '23 at 19:18
How can it know that? And using `M.store(M.load())` is wrong, the value can change in the meantime and this will overwrite it. – Chayim Friedman Jan 31 '23 at 19:20
@ChayimFriedman Sorry, didn't word that right. What I mean is that it could see that you are using the value of `0` which will have no change if any other threads are involved so it can just optimize it out. I was hoping they didn't do something like that. – NathanOliver Jan 31 '23 at 19:23
Maybe do this in a function with optimisation disabled. Probably won't cost you much. – Paul Sanders Jan 31 '23 at 20:02
@ChayimFriedman: It's interesting to look at what clang does for other ISAs; it's not the same as `atomic_thread_fence(acq_rel)`. https://cpp.godbolt.org/z/czdMrrd9j I'm not sure exactly what's going on. x86's very strong memory model (and simple [mapping from C++ to asm](https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)) may mean that maintaining program order is all that's needed to get everything *ISO C++* guarantees about this statement. – Peter Cordes Jan 31 '23 at 23:49
@ChayimFriedman: Note that any assumptions based on there being a coherent shared cache and only local reordering of loads/stores aren't safe. But in terms of that model, an atomic RMW isn't guaranteed to be a single access to cache/memory; the store part can reorder with later loads/stores unless they're both seq_cst, as long as it doesn't let any operations from other CPUs *on the object being atomically modified* sneak into that window between load and store parts. [For purposes of ordering, is atomic read-modify-write one operation or two?](https://stackoverflow.com/q/65568185) – Peter Cordes Jan 31 '23 at 23:50
@NateEldredge If this is no-op, while doesn't clang optimizes this on, say, ARM? Of course this is not a proof, but if this is machine-dependent optimization, then probably it is only sound (if it is sound) because of the strong memory model of x86. – Chayim Friedman Feb 01 '23 at 17:38
1

So, I take it back, it is not a complete no-op. Even though we do not inspect the value of the acquire load, we might still be able to deduce the value that must have been loaded, and thereby get a happens-before. So if we have `if (a.load(relaxed)) { B; (void)a.fetch_add(0, acq_rel); C; }`, it will in general prevent C from being reordered with the *first* load (but not from being reordered with B, because they can both be reordered "between" the load and the store of `fetch_add()`). – Nate Eldredge Feb 04 '23 at 07:59
1

On x86, that first reordering is already impossible, because the earlier load was actually acquire, so the `fetch_add` can be optimized out completely. – Nate Eldredge Feb 04 '23 at 07:59
1

@PeterCordes: This actually makes another interesting example of the "RMW is two operations" principle. If you think acq_rel RMW is one "indivisible" operation then you would expect `a.fetch_add(0, acq_rel)` to effectively be a full barrier, but it isn't. So on x86 with clang, if you do `b.store(1, acquire); a.fetch_add(0, acq_rel); t = c.load(release);`, you can observe `b.store()` and `c.load()` being reordered. And the only way to explain that in terms of reordering is that they both got reordered into the middle of the `fetch_add`. – Nate Eldredge Feb 04 '23 at 08:08
How about `volatile int zero = 0; M.fetch_add(zero, std::memory_order_acq_rel);`? – Paul Sanders Feb 04 '23 at 18:33
@PaulSanders: It might happen to defeat the compiler's optimization for now, but it doesn't help with the basic conflict with the memory model. A compiler would be within its rights to optimize it into `int tmp = zero; if (tmp == 0) M.fetch_add(0,acq_rel); else M.fetch_add(tmp, acq_rel);`. Then it could optimize out the `fetch_add(0)` as done here if it has no effect in the machine's memory model, and end up with `int tmp = zero; if (tmp != 0) M.fetch_add(tmp, acq_rel);`. Which is still a no-op as `tmp != 0` will never be true. – Nate Eldredge Feb 06 '23 at 04:25
@NateEldredge But with something like `a.fetch_add(0, acq_rel); a.fetch_add(0,acq_rel)`. Even if RMW is two operations, isn't it guaranteed that it forms a happens-before relationship either the first with the second or the second with the first? Because if the second load is ordered after the first load but before the first store, the latter store will not modify the correct value, right? – Chayim Friedman Feb 06 '23 at 08:07
Agreed, one happens before the other, but I don't see how that is useful information if you don't know which is which. If thread 1 does `X; a.fetch_add(0, acq_rel);` and thread 2 does `a.fetch_add(0, acq_rel); Y;`, where X and Y conflict, then you cannot avoid a data race because it's always possible that the thread 2 fetch_add happens before the one in thread 1, and then you do not get to conclude X happens before Y nor vice versa. – Nate Eldredge Feb 08 '23 at 06:46
clang's code gen here makes more sense if you actually use the value returned by fetch_add. Then you get `mfence ; mov eax, [M]`; a barrier and a load but no store. If you write `X ; t = M.fetch_add(0, acq_rel);`, and some other thread does `M.fetch_add(1, acq_rel); Y;`, then if the final value of `M` is equal to `t+1`, you know that X happened before Y. To ensure this, it suffices that X becomes visible before we get the value out of M, hence the mfence. But then it's immaterial whether you actually do the store, because nobody can ever detect it. – Nate Eldredge Feb 08 '23 at 06:54
If the value is not used, then the `mov` is unnecessary, because no other thread can tell whether we did it or not. So it is optimized out. And then the `mfence` is also unnecessary because there's nothing on the other side of it, so it goes away as well, leaving no code. – Nate Eldredge Feb 08 '23 at 06:55
@NateEldredge I'm starting to understand; it seems I have no options other than synchronizing on the next pointers. Thanks! – Chayim Friedman Feb 08 '23 at 18:30

score 1 · Answer 2 · answered Feb 06 '23 at 04:19

I can only come up with one situation where a release store of the same value has any observable effect whatsoever. Namely, to establish a release sequence to head relaxed RMW operations that can already be proved to be later in the modification order.

For instance, the following code:

x.store(3, release); // A
// other operations, B
x.fetch_add(0, release); // C
// more junk
x.fetch_add(1, relaxed); // D

Assume there are no other stores to x.

Now suppose some acquire load L in another thread loads the value 4 from x. This must have been stored by D. Since D happens after C (by sequencing), D's store follows C's in the modification order of x. Therefore D is part of the release sequence headed by C. So C synchronizes with L, and in consequence, we can conclude that B happens before L. We would not have been able to conclude that if C were not there.

Note this only works because D is in the same thread as C. If D were in some other thread, then when L loaded the value 4, we would not know whether D was part of the release sequence headed by C, or that of A. If it were A, then B would not happen before L. We therefore have no way to ever prove that B happens before L.

For the same reason, if the release sequence of C does not contain any values other than 3 (e.g. if line D were not there at all), then we could never prove that B happens before L.

It also does not work if D is a simple store of 4, rather than a read-modify-write. As of C++20, later stores in the same thread, even though they are clearly later in the modification order, cannot participate in a release sequence. See What is the effect of the change to the definition of release sequences in the C++20 memory model?

All that said, a much simpler way to ensure that B happens before L would be to upgrade line D to release, and then line C can be removed. So it's difficult to see why something like C would ever be desirable in practice.

Oh wait, maybe the idea is that B and C happened in a different thread. If that thread inspects the value returned by C and sees that it is 3, then a third thread loads x and gets the value 4, we know that B happened before the third thread's load. That can't be accomplished by merely strengthening D. I'll have to keep thinking about this. — Nate Eldredge, Feb 06 '23 at 07:31

Establish a release atomic ordering on an atomic object without writing into it

2 Answers2