Synchronization of std::atomic::compare_exchange_weak

Question

void push(T const& data)
{
  node * const new_node = new node(data);
  new_node->next = head.load();
  while(!head.compare_exchange_weak(new_node->next, new_node,
  std::memory_order::memory_order_release, std::memory_order::memory_order_relaxed));
}

How above code guarantees that if two threads start push at the same time and come to compare_exchange_weak at the same time, and first thread changes value of head, second one can see new head value?

That's the purpose of atomic operations, such that they cannot be performed by multiple threads at the same time. Basically, they impose some order of memory operations on a single memory location. What is unclear about it? Could you provide more details on what you don't understand? — Daniel Langr, Apr 20 '21 at 06:41
@DanielLangr i can understand that both threads cant see this operation undone, but if first changes value of head, and second thread tries to read value of head why should he see new updated value? My question related to cache coherence protocol — questionmark, Apr 20 '21 at 06:50
[Atomic operations](https://en.wikipedia.org/wiki/Linearizability) are (if supported by the hardware) on CPU level, otherwise in C++ it falls back to non-lock-free implementation (mutex). — user1810087, Apr 20 '21 at 06:50
@questionmark If your question is about cache coherency, you should tag it properly. Anyway, have you tried to read something about this topic? There are lots of materials all over the internet. You should also specify a system architecture you are curious about, C++ does not care about caches. — Daniel Langr, Apr 20 '21 at 06:53
@DanielLangr My question is both about cache coherence and c++ atomic — questionmark, Apr 20 '21 at 06:55
@questionmark C++ does not care about cache coherency. It's a matter or a C++ implementation, which, however, when generates suitable instructions, also does not need to care about cache coherency since it is usually provided by hardware automatically. Cache coherency simply works independently of where the memory operations originated. — Daniel Langr, Apr 20 '21 at 07:08
@DanielLangr so can thread2 see old value of head after thread1 wrote to it or not? — questionmark, Apr 20 '21 at 07:22
@questionmark No, it can't. You basically wouldn't be able to create multithreaded programs if there wasn't a global order of visibility of memory operation effects on a single memory location. You can read more about it for example [here](http://eel.is/c++draft/intro.multithread), but the formal description is not easy to understand. In your last question, it's not that easy to define what does it mean _after_; there are no global clocks. — Daniel Langr, Apr 20 '21 at 07:48
@DanielLangr But atomics were created exactly for that, to create an order. "After" means that Im assuming that some operation happened after(in time domain) another. — questionmark, Apr 20 '21 at 07:57
@questionmark But there is no global time, therefore, there is no "time domain". The only way you can define _after_ in your case is that the second thread reads (within `compare_exchange_weak`) the value updated by the first thread. Then, you know that this read happened after an update in the first thread. — Daniel Langr, Apr 20 '21 at 08:08
@DanielLangr If we take x86 architecture with MESI, does it guarantee that all atomic stores are synchronized(if one threads writes to cache line the other threads will see this cache line changed)? — questionmark, Apr 20 '21 at 08:20
It depends. In case of the memory order; yes. Cache coherence requires that a read needs to see the most recent write before it in the memory order. But it doesn't need to respect the real time order. E.g. a load could be served from the cache, after a store has retired because the store is still in the store buffer. — pveentjer, Apr 20 '21 at 08:43
But unless you hook up special measuring equipment, you will not be able to detect this. — pveentjer, Apr 20 '21 at 08:44
@questionmark Sure, that's the purpose of cache coherency. Before a core can write anything to its cache line, all corresponding cache lines in other cores must be invalidated first. Which is called cache contention, which can have only one winner. Then, the write is propagated throughout the memory system. When the other (looser) core then wants to access the same cache line, it has its state _invalidated_. This means that this cache line must be again synchronized with memory, which loads the value written to it by the originally winning core. — Daniel Langr, Apr 20 '21 at 08:45
It depends on what you mean. If you mean if you can see a stale value in the memory order; no. Absolutely not; otherwise you would not have cache coherence. If you mean real time order: yes.. certainly. But you can't tell it happened unless you hook up special measuring equipment. — pveentjer, Apr 20 '21 at 08:45
@pveentjer That's kind-of a different case. If one thread just loads the value from some memory location, then it can load the "old" one since some other thread can have a store in its store buffer (which is not cache-coherent). But in OP's question, there are RMW operations. With RMW operations, this cannot happen. RMW requires the _exclusive_ cache line state even before the read. And, the write must be propagated to the cache line before the exclusivity is given up. (RMW needs exclusivity between read and write, store needs exclusivity just when flushed from a store buffer.) — Daniel Langr, Apr 20 '21 at 09:02
You are right. My comment was about an SC load and store. For an atomic RMW the rules more strict since the load/store need to appear on the memory order next to each other without any interleaving loads/stores. — pveentjer, Apr 20 '21 at 09:07
@DanielLangr did I understand you correctly? If RMW requires exclusive cache line in one thread other threads won't see this line unsynchronized simply because exclusive for one thread means invalidated for other threads? — questionmark, Apr 20 '21 at 09:09
Correct. The CPU will even 'lock' the cacheline so that the cacheline can't be obtained by other CPUs during the RMW. — pveentjer, Apr 20 '21 at 09:10
@pveentjer so then I can't imagine situation when any atomic operation can't be visible to other threads right after store operation just simply any store operation requires exclusive state for cache line — questionmark, Apr 20 '21 at 09:17
There are different types of atomic operations. Atomic operations which are just a simple load/store and atomic operations containing a RMW. In latter case, a CPU need to see the latest write before the CPU can do the write. Otherwise the atomic RMW could suffer from problems like lost updates. — pveentjer, Apr 20 '21 at 09:20
In case of an atomic store; the cacheline is only required in EXCLUSIVE/MODIFIED state once the store going to commit to the cache. But as long as the store is in the store buffer, the cacheline could still be in SHARED state; so another CPU could also have the cacheline in SHARED state and still read from it. An atomic store and an atomic load work differently than an atomic RMW. — pveentjer, Apr 20 '21 at 09:25
@questionmark If a core has a new to-be-stored value in its store buffer, then other cores still see the old value. But this does not pose any problem. With these simple stores/loads there is no global agreement about when stores should be propagated. — Daniel Langr, Apr 20 '21 at 09:27
There is global agreement (well.. nobody can't be disagreeing) on the order of the stores ending up in the memory order; at least on X86. — pveentjer, Apr 20 '21 at 09:29
@DanielLangr does it happen because RMW "locks" cache line, but simple stores do not? — questionmark, Apr 20 '21 at 09:30
In theory even for a store the cacheline is 'locked' Because you get the cacheline in EXCLUSIVE/MODIFIED state after the Request For Ownership (RFO) returns. Only when the content is written to the cache, the cacheline is 'released'. So in that sense there is still a short period where the cacheline can't be revoked. The primary difference is that for an atomic store this 'locking' of the cacheline happens after the store has executed/retired. But for an RMW, the cacheline is locked before the RMW starts and released when the RMW operation retires. — pveentjer, Apr 20 '21 at 09:36
@pveentjer Yes, there is a global agreement such that the order of stores are visible the same for all cores. I meant that there is no agreement when a particular single store will happen (be propagated, flushed from a store buffer). — Daniel Langr, Apr 20 '21 at 09:37
@questionmark Stores also need to "lock" a cache line, when they are flushed from the store buffer. What is different is that **loads do not need to lock cache lines**. They only need the cache line not to be invalidated. — Daniel Langr, Apr 20 '21 at 09:41
@questionmark These are, therefore, two different cases. When all cores perform RMW, as in your questions, they all "lock" the cache line. In the scenario when cores just do stores and loads, the loads do not lock the cache line. Only, once it is invalidated (by an other core store committed from its store buffer), it needs to be synchronized. — Daniel Langr, Apr 20 '21 at 09:44
@DanielLangr if stores lock the cache line how then load can access it? — questionmark, Apr 20 '21 at 09:48
@questionmark It can access it only either before it is "locked" or after this "lock" is released. In the first case, the load reads the original value. In the second, it read the new stored value. But the load never "locks" the cache line itself. Read some material about cache coherency, even on Wikipedia it's explained well, for instance here: https://en.wikipedia.org/wiki/MESI_protocol. Stack Overflow is not a proper site for learning complete details of cache coherency mechanisms. — Daniel Langr, Apr 20 '21 at 09:54
@DanielLangr there is nothing about cache line locks. Moreover all articles/videos I've investigated had nothing about cache line locks as well — questionmark, Apr 20 '21 at 10:05
@questionmark That's why I marked them as "locks". These "locks" are simply making cache lines exclusive plus invalidated for other cores. You can think of them as locks. It doesn't matter how do we call it. It matters how does it work. — Daniel Langr, Apr 20 '21 at 10:30
@DanielLangr what if T1 stores to cache line(in shared state), puts it into store buffer and invalidates other threads cache lines, in that moment other thread needs this cache line and reads from T1 during modified cache line still in store buffer, will it make store buffer to flush ,or it will read old cache line from cache? — questionmark, Apr 20 '21 at 10:41
@questionmark Putting into store buffer does not invalidate a cache line. Flushing store buffer will. And yes, the other thread will read the old value until its cache line is invalidated. Simply think about this invalidation as a "barrier" that separates reading the old and new value. (BTW, cache lines are not in a store buffer. Just stores are.) — Daniel Langr, Apr 20 '21 at 10:58
Once the cacheline has been invalidated on the other CPUs, the CPUs need to wait till the cacheline has been released on the first CPU. It will not cause any forcing of flushing; the first CPU is already working as fast as it can. On Intel CPU's there is a special buffer called the Line Fill Buffer that sits between the load/store-buffer and the cache and where stores/loads will wait in case of a cache miss. — pveentjer, Apr 20 '21 at 10:59
If you want to learn more about how the X86 memory model works, I would suggest checking out the following book. https://www.morganclaypool.com/doi/pdf/10.2200/S00962ED2V01Y201910CAC049 — pveentjer, Apr 20 '21 at 11:12
@DanielLangr wow, thats intereseting, every article I read about MESI mentioned that first thread sends read/invalidate request, than puts value in its store buffer(or first puts then sends) and only then commits the change, but not sending invalidation request when flushing the buffer. For example here [link](http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf) — questionmark, Apr 20 '21 at 11:14
You need to keep in mind that modern CPUs are speculative. So it might be that the store on the store buffer is not going to retire at all. Also the whole purpose of placing the item on a store buffer, it to prevent waiting for the RFO to return; so to hide the memory latency. It could be that CPUs make use of RFO prefetching to speed up the process a bit; but this is just a performance optimization. — pveentjer, Apr 20 '21 at 11:29
@pveentjerso RFO request is made before flushing store buffer or not? Seems to me that obv behavior is like this T1 puts value to store buffer -> sends read-invalidate request(RFO) -> receives read response and invalidation acknowledgement -> flushes its store buffer so newly read cache line now contains its new data, am I wrong? — questionmark, Apr 20 '21 at 11:36
The stores are put on on the store buffer speculatively and eventually they retire. After that the stores are written to the cache in PO.. when they want to enter the cache, first the cacheline needs to be right state which will involve a RFO if the cacheline isn't in MODIFIED/EXCLUSIVE state. Once the ACKs for the RFO have been received, the store can be written to the cache and removed from the store buffer (don't say flushing.. not correct). The cacheline now contains the updated bytes from the store and can be read by other CPUs. — pveentjer, Apr 20 '21 at 11:45
@pveentjer but what is the point of puting value in store buffer if cache line is in modified or exclusive states? I thought that some value goes to store buffer ONLY when processor has to synchronize with other threads (read/invalidate) — questionmark, Apr 20 '21 at 11:48
Instructions can be executed speculatively; so it might be that your store needs to be rolled back. Even if there would not be speculatively execution, on the X86 with the Total Store Order, you need to make sure that stores appear in the cache in PO. Imagine you have 2 stores; first A=1 and then B=1 and B is already in modified state and A is in shared state. If you would immediately write to B and later to A, the stores would appear out of order. This might be fine on architectures with a weaker memory order, but not on the TSO and also not fine if you are relying on SC. — pveentjer, Apr 20 '21 at 11:53
@pveentjer but if we are talking only about mesi it is pretty possible, isn't it? — questionmark, Apr 20 '21 at 12:01
Sure, why not. If you look for example at (Partial Store Order) PSO which is a relaxation of TSO, then stores to different locations can appear out of order. So you could retire from the store buffer early or bypass it completely. And you can coalesce writes to the same cache-line which then could appear out of order. But the only ISA I know that supports PSO is the SPARC v8. — pveentjer, Apr 20 '21 at 12:04
@pveentjer the last and the most important question,could you please show me how load lock in RMW affects store buffers etc. ,and how simple store+load affect it(preferably with examples such as 1 -> 2 ->...) — questionmark, Apr 20 '21 at 12:29
This part I'm not 100% sure of, but AFAIK it goes like this. First the load/store units are stopped and the CPU waits with RMW till the SB is drained. Once the cacheline is received in EXCLUSIVE state, the RMW continues. Either the store is written to the SB and the CPU waits again for the SB to be drained or it immediately writes to the cacheline and bypassing the SB. Then the cacheline is unlocked and the load/store units are enabled. The idea is that the read/write do not have any interleaving operations including other atomic operations. But my understanding is a bit shaky here. — pveentjer, Apr 20 '21 at 12:39
@questionmark I would suggest you to take a look at Peter Cordes' answers, such as those related to the 'store buffer' keyword: https://stackoverflow.com/search?q=user%3A224132+store+buffer, but also the other ones. He has some excellent and quite detailed explanations about the issues you are interested in. His answers also usually contain links to other relevant answers. This may taky many many hours to study to understand what you are up to, but it will likely be worth it. (It may be also a good idea to remove this question. The discussion in the comment is definitely off-topic on SO). — Daniel Langr, Apr 20 '21 at 12:39
I have also find these interesting and relevant sources: https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/ and https://stackoverflow.com/q/63008857/580083 (note that common, though maybe imprecise, terminology uses the word "lock" w.r.t. cache lines; even the Intel instruction set has that `lock` prefix). — Daniel Langr, Apr 20 '21 at 14:31

Synchronization of std::atomic::compare_exchange_weak

0 Answers0

Linked