16

I don't understand, why will be problems without release sequence, if we have 2 threads in the example below. We have only 2 operations on the atomic variable count. count is decremented sequently as shown in the output.

From C++ Concurrency in Action by Antony Williams:

I mentioned that you could get a synchronizes-with relationship between a store to an atomic variable and a load of that atomic variable from another thread, even when there’s a sequence of read-modify-write operations between the store and the load, provided all the operations are suitably tagged. If the store is tagged with memory_order_release, memory_order_acq_rel, or memory_order_seq_cst, and the load is tagged with memory_order_consume, memory_order_acquire, or memory_order_seq_cst, and each operation in the chain loads the value written by the previous operation, then the chain of operations constitutes a release sequence and the initial store synchronizes-with (for memory_order_acquire or memory_order_seq_cst) or is dependency-ordered-before (for memory_order_consume) the final load. Any atomic read-modify-write operations in the chain can have any memory ordering (even memory_order_relaxed).

To see what this means (release sequence) and why it’s important, consider an atomic<int> being used as a count of the number of items in a shared queue, as in the following listing.

One way to handle things would be to have the thread that’s producingthe data store the items in a shared buffer and then do count.store(number_of_items, memory_order_release) #1 to let the other threads know that data is available. The threads consuming the queue items might then do count.fetch_sub(1,memory_ order_acquire) #2 to claim an item from the queue, prior to actually reading the shared buffer #4. Once the count becomes zero, there are no more items, and the thread must wait #3.

#include <atomic>
#include <thread>
#include <vector>
#include <iostream>
#include <mutex>

std::vector<int> queue_data;
std::atomic<int> count;
std::mutex m;
void process(int i)
{

    std::lock_guard<std::mutex> lock(m);
    std::cout << "id " << std::this_thread::get_id() << ": " << i << std::endl;
}


void populate_queue()
{
    unsigned const number_of_items = 20;
    queue_data.clear();
    for (unsigned i = 0;i<number_of_items;++i)
    {
        queue_data.push_back(i);
    }

    count.store(number_of_items, std::memory_order_release); //#1 The initial store
}

void consume_queue_items()
{
    while (true)
    {
        int item_index;
        if ((item_index = count.fetch_sub(1, std::memory_order_acquire)) <= 0) //#2 An RMW operation
        {
            std::this_thread::sleep_for(std::chrono::milliseconds(500)); //#3
            continue;
        }
        process(queue_data[item_index - 1]); //#4 Reading queue_data is safe
    }
}

int main()
{
    std::thread a(populate_queue);
    std::thread b(consume_queue_items);
    std::thread c(consume_queue_items);
    a.join();
    b.join();
    c.join();
}

output (VS2015):

id 6836: 19
id 6836: 18
id 6836: 17
id 6836: 16
id 6836: 14
id 6836: 13
id 6836: 12
id 6836: 11
id 6836: 10
id 6836: 9
id 6836: 8
id 13740: 15
id 13740: 6
id 13740: 5
id 13740: 4
id 13740: 3
id 13740: 2
id 13740: 1
id 13740: 0
id 6836: 7

If there’s one consumer thread, this is fine; the fetch_sub() is a read, with memory_order_acquire semantics, and the store had memory_order_release semantics, so the store synchronizes-with the load and the thread can read the item from the buffer.

If there are two threads reading, the second fetch_sub() will see the value written by the first and not the value written by the store. Without the rule about the release sequence, this second thread wouldn’t have a happens-before relationship with the first thread, and it wouldn’t be safe to read the shared buffer unless the first fetch_sub() also had memory_order_release semantics, which would introduce unnecessary synchronization between the two consumer threads. Without the release sequence rule or memory_order_release on the fetch_sub operations, there would be nothing to require that the stores to the queue_data were visible to the second consumer, and you would have a data race.

What does he mean? That both threads should see the value of count is 20? But in my output count is sequently decremented in threads.

Thankfully, the first fetch_sub() does participate in the release sequence, and so the store() synchronizes-with the second fetch_sub(). There’s still no synchronizes-with relationship between the two consumer threads. This is shown in figure 5.7. The dotted lines in figure 5.7 show the release sequence, and the solid lines show the happens-before relationships enter image description here

curiousguy
  • 8,038
  • 2
  • 40
  • 58
Ivan Kush
  • 2,958
  • 2
  • 23
  • 39
  • What is the question really? Why doesn't the std just say that an acq read sync with all the rel store that ever occurred? – curiousguy Apr 08 '19 at 04:05

6 Answers6

11

it means that the initial store is synchronized-with the final load even if the value read by the final load isn't directly the same value stored at beginning, but it is the value modified by one of the atomic instruction which could race into. A simpler example, assuming there are three threads racing which executes these instruction (assume x initialized to 0 before the race)

// Thread 1:
A;
x.store(2, memory_order_release);

// Thread 2:
B;
int n = x.fetch_add(1, memory_order_relaxed);
C;

// Thread 3:
int m = x.load(memory_order_acquire);
D;

What are the possible values read for n and m according to possible results of the race? And what are the guarantees that we have on the ordering of instructions A, B, C, and D based on what we read on m and n? For n we have two cases, either 0 or 2. For m we could read 0, 1, 2, and 3. There are six valid combinations of the two. Let's see each case:

  • m = 0, n = 0. We don't have any synchronizes-with relationship, thus we can't infer any happens-before relationship except for the obvious B happens-before C

  • m = 0, n = 2. Even though the fetch_add operation read the value written by the store, since the fetch_add has a relaxed memory ordering there is no synchronizes-with relationship between the two instruction. We can't say that A happens-before C

  • m = 1, n = 0. Similarly as before, since fetch_add don't have a release semantic we can't infer a synchronizes-with relationship between the fetch_add and the load operation, hence we don't know whether B happens-before D

  • m = 2, n = 0. The value we read with the acquire semantic load has been written with a release semantic store. We are guaranteed that the store synchronizes-with the load, hence A happens-before D

  • m = 2, n = 2. Same as above, the store synchronizes-with the load, hence A happens-before D. As usual, the fact that the value read from fetch_add is the same as the one stored from thread 1 do not imply any synchronization relationship.

  • m = 3, n = 2. In this case the data read by the load has been written by the fetch_add, and the data read by the fetch_add has been written by the store. However because fetch_add has relaxed semantic, no synchronization can be assumed between store and fetch_add and between fetch_add and load. Apparently, in this case no synchronization can be assumed, same as the case m = 0, n = 0. Here is where the release sequence concept comes in handy: the release semantic store in thread 1 will synchronize-with the acquire semantic load in thread 3 as long as the value that is being read has been written in the release sequence, which includes

    1. all the stores performed later in the same thread as the release operation
    2. all the atomic read-modify-write operation which read a value from the same release sequence.

    In this case since fetch_add is an atomic read-modify-write operation we know that the store in thread 1 synchronizes-with the load in thread 3, and thus A happens-before D. We still can't say anything about the ordering of B and C though.

In your case you have this pseoudocode, assuming number_of_items = 2:

// Thread 1
Item[0] = ...;
Item[1] = ...;
count.store(2,memory_order_release);

// Thread 2
int i2 = 0;
while (i2 = count.fetch_sub(1,memory_order_acquire) <= 0 ) sleep();
auto x2 = Item[i2-1];
process(x2);

// Thread 3
int i3 = 0;
while (i3 = count.fetch_sub(1,memory_order_acquire) <= 0 ) sleep();
auto x3 = Item[i3-1];
process(x3);

Let's assume that the first positive value read into i2 is 2, and thus the first positive value read into i3 is 1. Since the value read from Thread 2 has been written from the store in Thread 1, the store synchronizes-with the load, and we know that Item[1] = ...; from Thread 1 happens-before auto x2 = Item[1]; in Thread 2. However the value 1 read from Thread 3 has been written by Thread 2, with fetch_sub which has no release semantic. The fetch_sub from Thread 2 thus does not synchronizes-with the fetch_sub from Thread 3, however since the fetch_sub from Thread 2 is part of the release chain of the store in Thread 1, the store in Thread 1 also synchronizes-with the fetch_sub in Thread 3, from which we know that Item[0] = ...; happens-before auto x3 = Item[0];

qbolec
  • 5,374
  • 2
  • 35
  • 44
pqnet
  • 6,070
  • 1
  • 30
  • 51
  • agreed! That comment has clarified alot for me – mkmostafa Dec 13 '18 at 09:30
  • `except for the obvious B happens-before C` Are you sure about that? In Sutter's Atomic Weapons, he says that memory_order_relaxed does not generate any barriers, so anything can flow up and down. Therefore B would not be guaranteed to happen before C? – geranimo Dec 01 '20 at 02:57
  • @geranimo Operations in the same thread are always guaranteed to be sequentially ordered, in the same order you wrote them. This was true even in the C++98 semantic. – pqnet Dec 01 '20 at 11:04
  • @pqnet That's not true. The compiler and CPU are allowed to reorder things out as it fits them. Only instructions in the same thread that have a dependency are required to happen after their dependencies get computed. – geranimo Dec 02 '20 at 17:34
  • @geranimo from an observable memory order point of view (which is the only thing considered by `happens-before`) it does. It is `sequenced-before`, which is even stronger. – pqnet Dec 02 '20 at 18:57
  • @pqnet Yeah ok the confusion is because `happens-before` is not the same as happening before. B is not guaranteed to happen before C even if B `happens-before` C – geranimo Dec 02 '20 at 19:18
  • @pqnet and from the [rules of `sequenced-before`](https://en.cppreference.com/w/cpp/language/eval_order) where do you see that B is `sequenced-before` C? – geranimo Dec 02 '20 at 19:40
  • 2
    @geranimo rule #1. They are `full expressions` because they are split by a semicolon. Complicated rules are for subexpressions (like parameter evaluation) where you don't know which one is evaluated first except in some cases. Note that `sequenced-before` and `happens-before` do not preserve the english meaning literally, `happens-before` means that you can rely on side effects from the first expression to be observable in the second expression, has nothing to do with instruction reordering (which for `sequenced-before` instructions happens only if you don't observe these side effects) – pqnet Dec 03 '20 at 19:13
  • So what's the benefit of using `memory_order_acquire` instead of `memory_order_acq_rel` in `fetch_sub`? – LunarEclipse Mar 10 '23 at 09:13
  • @LunarEclipse by requiring a smaller amount of synchronization and ordering in the code you write (just the minimal requirement so that your program is still correct) you may enable your compiler or CPU to emit more efficient code. It is not clear to me yet whether compilers are taking advantage of this, for some architecture the only way you can guarantee that a minimal `acquire`/`release` ordering may be to emit full memory barrier, but if you write your code correctly you may take advantage of (or get screwed by, if you do it wrong) future compiler optimization patterns. – pqnet Mar 12 '23 at 22:06
6

i stumbled over the exact same question like you did. i thought i got the understanding right and then he comes in with this example and only uses std::memory_order_aquire. it was difficult to find any good information on this, but finally i found some helpful sources. the main information i was not aware of was the simple fact, that read-modify-write operations ALWAYS work on the newest/latest value, no matter what memory order given (even std::memory_order_relaxed). this ensures, that you wont have the same index two times in the example. still the ordering of operations can mix up (so you dont know which fetch_sub will happens before the other).

this is an answer of anthony williams himself stating that read-modify-write operations always work on the latest value: Concurrency: Atomic and volatile in C++11 memory model

additionally, someone asked about the fetch_sub in combination with the shared_ptr ref count. here anthony williams responded too and brings clarity into the situation with the reordering of the fetch_sub: https://groups.google.com/a/isocpp.org/forum/#!topic/std-discussion/OHv-oNSuJuk

phön
  • 1,215
  • 8
  • 20
5

What does he mean? That both threads should see the value of count is 20? But in my output count is sequently decremented in threads.

No he doesn't. All modification to count are atomic, so both reader threads would always see different values for it in the given code.

He's talking about the implications of the release sequence rule, namely that when a given thread performs a release store, other multiple threads that then perform acquire loads of the same location form a release sequence, in which each subsequent acquire load has a happens-before relationship with the storing thread (i.e. the completion of the store happens-before the load). This means that the load operation in the reader thread is a synchronisation point with the writer thread, and all memory operations in the writer prior to the store must complete and be visible in the reader when its corresponding load completes.

He's saying that without this rule, only the first thread would be thus synchronised to the writer. The second thread would therefore have a data race in accessing queue (note: not count, which is protected anyway by atomic access). Theoretically, memory operations on data occurring before the store on count could be seen by reader thread number 2 only after its own load operation on count. The release sequence rule assures that this will not happen.

In summary: the release sequence rules assures multiple threads can synchronise their loads on a single store. The synchronisation in question is that of memory accesses to data other than the actual atomic variable being synchronised on (which is guaranteed to be synchronised anyway due to being atomic).

Note to add here: for the most part these kind of issues are only of concern on CPU architectures that are relaxed about reordering their memory operations. The Intel architecture is not one of them: it is strongly-ordered and has only a few very specific circumstances in which memory operations can ever be reordered. These kind of nuances are mostly only relevant when talking about other architectures, such as ARM and PowerPC.

Smeeheey
  • 9,906
  • 23
  • 39
  • `He's saying that without this rule, only the first thread would be thus synchronised to the writer. The second thread would therefore have a data race in accessing queue.` Do you mean, that without `release sequence` 1st consumer thread will see that `queue` has values `{1, 2, ..., 18, 19}` and the 2nd consumer may see that `queue` has only values `{ 1, 2, 3 }`? – Ivan Kush Jul 25 '16 at 21:43
  • 1
    More generally, a data race implies behaviour is undefined. But yes, your example is hypothetically possible. You can imagine a setup where reader thread 2, running on a different physical CPU and cache, would access a stale version of `queue` because in the absence of the release sequence requirement its cache line containing `queue` was not synched to the other cache with the updated `queue`. But as I said, this couldn't happen on an Intel, only on other architectures – Smeeheey Jul 26 '16 at 08:01
  • Can you describe that other architecture? How could a compiler make this unsafe, trying as hard as possible? – curiousguy Apr 23 '19 at 00:29
  • This is not about the compiler but rather about the CPU's dynamic memory access behaviour. Suppose you have 3 physical CPUs with 3 distinct L1 caches. CPU1 is the writer, CPU2 and CPU3 are readers. When CPU1 performs the store in its local cache and due to the aquire-release sync of the fetch_sub CPU1 has its equivalent cache-line updated before the load. However, without the release sequence rule CPU1 has no such obligation to the cache of CPU3, so CPU3 ends up reading stale values of `queue`. Note I don't know if any CPU architectures actually behave this way, this is hypothetical. – Smeeheey Apr 24 '19 at 11:23
  • For any architecture that uses fences, a compiler must insert a fence on release and acquire to a shared atomic object (local atomics may be optimized out). Then it wouldn't matter whether there is even a release sequence. Even if the last modification was done by another thread (and not a RMW), the guarantee that writes of all thread that performed rel **ever** are visible by all threads that performed acq seems practically unavoidable in practice as-if there was H-B even if formal semantics says otherwise. – curiousguy May 07 '19 at 23:06
  • Any number of threads can do a pure load, like `foo.load(acquire)` and synchronize without depending on the release-sequence rule. A release-sequence is important when the threads are doing an acquire **RMW**, not just a load. Or when there's some *other* RMW operation (by any thread, including a 3rd party) on the atomic object, even if the RMW itself is `relaxed`. – Peter Cordes Sep 28 '22 at 05:29
2

fetch_sub is read-modify-write action. it atomically reads the value from the memory address, decrement it by the argument provided and then writes it back to the memory address . it all happens atomically.

now, every atomic action reads and writes directly to the memory address. the CPU does not rely on a value cachced in the registers or the cache-lines for performance gain. it reads and writes to the memory address directly and prevent othr CPU's to do so in that time.

what "plain" (==relaxed) atomicity does not provide is reordering. both the compiler and the CPU scramble reads and writes in order to speed up the execution of the program.

look at the example below:

atomic integer i
regular integer j

Thread A:
i <- 5
//do something else
i -> j
//make some decisions regarding j value.

Thread B:
i++

if no memory order is supllied the compiler and the CPU are allowed to transform the code to

Thread A:
i -> j
i <- 5
//do something else
//make some decisions regarding j value.

Thread B:
i++

Which is of-course not what we wanted. the decision making is wrong.

what we need is memory reordering.

memory order acquire: don't scramble memory accesses before
memory order release: don't scramble memory accesses after

going back to the question:

fetch_sub is both reading a value and writing a value. by specifying a memory order acquire we say "I only care about the order of actions the happened before the reading"
by specifying memory order release we say "I only care about the order of actions happened after the writing.

But you do care about memory access before and after!

if you only have one consumer thread, than the sub_fetch does not effect anyone, because the producer anyway uses plain store and the affects of fetch_sub are only visible to the thread which invoked fetch_sub. and in this case, you only care about the reading - the reading gives you the current and updated index. what happens after you store the updated index (lets say x-1) is not that important.

but since there are two threads which read and write to counter it is important that thread A will be aware that thread B wrote a new value to the counter and Thread B is aware that Thread A is about the read the value of counter. also vice versa- Thread B must be aware that Thread A wrote a new value to counter and Thread A must be aware that Thread B is about to read a value from counter

you need both guarantees - every thread states that it is about to both read and write to the shared counter. the memory order you need is std::memory_order_acquire_release.

But the example is tricky. the producer thread simply stores a new value in counter regardless of the value which was there before. if the producer thread was to incremenet the counter each time it pushes new item - you had to use std::memory_order_acquire_release in both the producer and the consumer threads even if you had one consumer

David Haim
  • 25,446
  • 3
  • 44
  • 78
  • "you need both guarantees - every thread states that it is about to both read and write to the shared counter. the memory order you need is `std::memory_order_acquire_release` " Do you mean, that an example from the book is not correct? Williams uses only `memory_order_acquire` and `memory_order_release` – Ivan Kush Jul 25 '16 at 21:39
  • i'm fairly sure that instruction in the same thread that are ordered in program order have a `happen-before` relationship too, hence the value to which `j` is set will either be `5` or `6` (if the `i++` races between the `i<-5` and the `j<-i` instructions) – pqnet Jun 29 '17 at 13:33
2

I got confused with the same problem, and I cannot figure out the meaning of the "release sequence", so I look up the ISO/IEC 14882:2011 for the definition.

The standard first define the modification order:

All modifications to a particular atomic object M occur in some particular total order, called the modification order of M. If A and B are modifications of an atomic object M and A happens before (as defined below) B, then A shall precede B in the modification order of M

Then it defines the release sequence:

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation

  • is performed by the same thread that performed A, or
  • is an atomic read-modify-write operation

Additionally, the standard gives the definition of the release operation:

memory_order_release, memory_order_acq_rel, and memory_order_seq_cst: a store operation performs a release operation on the affected memory location.

Last but not least, the standard specifies the behavior of atomic read-modify-write operations:

Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

I hope the detailed definations can help somebody who also get puzzled with the Chapter 5.3.4 Release sequences and synchronizes-with

syby119
  • 43
  • 4
1

The release-sequence guarantee lets an acquire load (or the load side of an acquire RMW) sync-with a release store even if the value it loads isn't directly from the release-store. It can still sync if the value is later in the modification-order of the atomic variable, if (and only if) all the intervening modifications were atomic RMWs, not pure stores.

e.g. foo.store(10, seq_cst) by another thread would break a release sequence headed by foo.store(10, release) or foo.store(10, seq_cst), but foo.compare_exchange_weak(x, y, relaxed) would continue it. Even if the CAS succeeds and stores a new value, even if that CAS was done by a 3rd thread, or by the original writer or the one that's going to do the acquire load. Same for fetch_sub

In this case, with the RMWs having acquire ordering, they all sync-with the original .store(number_of_items, release) to make it safe to read from queue_data[item_index - 1].
(It also has to be an atomic fetch_sub so different threads can claim a single entry for themselves. Execution of those fetch_sub operations is effectively serialized.)

In a hypothetical world without the release-sequence rule (where RMWs broke the chain), the producer might might need a second variable, like a data_ready flag that it sets to true, and readers would need to do an acquire load from that to load a value coming directly from the thread that stored to the non-atomic array.


Acquire/release basics

An acquire-load "synchronizes with" a release-store if it sees the value stored by that release-store, creating a happens-before relationship. See also https://preshing.com/20120913/acquire-and-release-semantics/

// writing thread
  payload = stuff;            // non-atomic, e.g. a largish struct something
  data_ready.store(1, std::memory_order_release);  // atomic release
// A reading thread
  if (data_ready.load(std::memory_order_acquire) {
     // now safe to read payload
  }
  // else do something else, maybe try again later.  Or perhaps spin wait.
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847