2

Suppose there's a multithreaded application where a single thread is inserting elements into a circular linked list while a number of working threads are walking through this list carring out the actual processing.

Say the node type is similar to this:

struct Node
{
    // ...
    std::atomic< Node * > next;
};

And in the method that perform insertion, there's the following snippet:

auto newNode = new Node( ); // (A)

newNode->next.store( previousNode->next.load( std:memory_order_relaxed ) ,
    std::memory_order_relaxed ); // (B)

previousNode->next.store( newNode , std::memory_order_relaxed ); // (C)

where previousNode has already been determined to be previous to newNode in the list.

The worker threads walk through the list in manner similar to this:

// ...
while ( true )
{
    ProcessNode( * currentNode );
    currentNode = currentNode.next.load( std::memory_order_relaxed );
}

There's no problem that a node just created in line (A) be skipped by the worker threads until its previous node is updated in (C).

Is there any issue with such a design? I'm concerned that in the assembly level the code genereated for (B) and (C) could be something like this:

LOAD( R1 , previousNode->next ) // (1) loads previousNode->next into register R1
WRITE( newNode->next , R1 ) // (2) writes R1 to newNode->next
WRITE( previousNode->next , newNode ) // (3) writes newNode to previousNode->next

And then some optimization could reorder it to:

LOAD( R1 , previousNode->next ) // (1)
WRITE( previousNode->next , newNode ) // (3)
WRITE( newNode->next , R1 ) // (2)

and that can break the worker thread, for it can now access newNode before its next member is initialized.

Is this a legitimate concern? What the standard says about this?

Tarc
  • 3,214
  • 3
  • 29
  • 41
  • I think `memory_order_relaxed` in the first `store` is fine if You add acquire/release in the rest. I mean, You can make it relaxed as other memory fences will ensure it is properly visible – bartop Feb 15 '18 at 21:56
  • I agree, @bartop. My actual code uses a store with release memory order in (C) and a load with an acquire order in the worker loop. I asked about the relaxed order to know if it makes sense at all (what it seems not to). – Tarc Feb 15 '18 at 22:00
  • @bartop I do not think so. If you do not have release on one side of the transaction, your acquire on the other side is useless. – SergeyA Feb 15 '18 at 22:01

2 Answers2

4

Yes, it is legitimate concern. Relaxed memory order does not enforce fences, it just guarantees atomicity of the operation. The code could be reordered by compiler, or, to the similar effect, by CPU itself, or, to the very similar effect, by the caching employed on CPU.

Was there any practical reason for you to choose relaxed order? I am actually yet to see any legitimate use for that order.

SergeyA
  • 61,605
  • 5
  • 78
  • 137
  • @bartop, may be... But I am not convinced. What if other thread tries to destruct the object because it didn't yet see the increased reference counter? – SergeyA Feb 15 '18 at 21:43
  • AFAIK in x86 there is no difference between relaxed and release/acquire consistency – Domso Feb 15 '18 at 21:48
  • @SergeyA You can implement double checking at destruction - relaxed when decreasing and acquire/release when detecting zero. – bartop Feb 15 '18 at 21:49
  • @Domso Assuming single architecture (either software or hardware) is dangerous and may lead to mistakes. For example MSVC `volatile` provides memory fences but we should not rely on it. EDIT: and it even caused many people to believe that volatile can be used for interthread communication – bartop Feb 15 '18 at 21:51
  • @bartop won't help. If one thread decreased without release, your acquire on the other thread will not do anything useful. – SergeyA Feb 15 '18 at 22:01
  • @SergeyA I found a bit of soruce here: https://stackoverflow.com/questions/27631173/how-can-memory-order-relaxed-work-for-incrementing-atomic-reference-counts-in-sm. In short - increasig can be done in relaxed manner as deleting is not a concern in this situation. Decreasing is indeed more complicated – bartop Feb 15 '18 at 22:04
  • 1
    @bartop I know that Sutter's position. I also have heard him adding 'probably' after he says that increasing reference can be done in the relaxed order. I also understand it is damned too complicated to get it correct (btw, not only for developers, but for compiler writers and chip manufactures as well!). My position is to not play with this for the dubious benefit. – SergeyA Feb 15 '18 at 22:08
  • 1
    @SergeyA Ok, I agree more than less. Without mathematic proof this argument is quite futile – bartop Feb 15 '18 at 22:09
  • 1
    @SergeyA `std::memory_order_acq_rel` is not allowed for a single store-operation, hence your code would be UB. See https://godbolt.org/g/XLVZKu. Of course you should not rely on the specific architecture, but it reduces the use of std::relaxed heavily, when x86 does not care anyway – Domso Feb 15 '18 at 22:11
  • @Domso, you are correct. I didn't realize it is UB. – SergeyA Feb 15 '18 at 22:30
2

You have a legitimate concern.

Exactly as you say, a compiler could legally re-order your store to this:

auto temp = previousNode->next.load( std:memory_order_relaxed )
previousNode->next.store( newNode , std::memory_order_relaxed ); // (C)
newNode->next.store(  temp, std::memory_order_relaxed ); // (B)

You have now inserted your node before its values were initialized! Whether or not this happens is the wrong question. This is perfectly legal for the compiler to do.

Here's an example of how a weakly-ordered CPU could do the same thing:

auto temp = previousNode->next.load( std:memory_order_acquire );
// previousNode->next is now hot in cache

newNode->next.store( temp, std::memory_order_release); // (B)
// Suppose newNode is in the cache, but newNode->next is a cache miss

previousNode->next.store( newNode , std::memory_order_release ); // (C)
// while waiting for cache update of newNode->next, get other work done.
// Write newNode into previousNode->next, which was pulled into the cache in the 1st line.

This won't happen on x86 because it has total store order. ARM, though... you have once again inserted your node before its values were initialized.

Better stick with acquire/release.

auto temp = previousNode->next.load( std:memory_order_acquire );
newNode->next.store( temp, std::memory_order_release); // (B)
previousNode->next.store( newNode , std::memory_order_release ); // (C)

The relavent release is line C, because it keeps line B from being moved after it. Line B has a data dependency on line 1, so realistically, it's not going to be re-ordered. But use acquire for line 1 and release for that line B anyway because it's semantically correct, it won't hurt anything, and it might prevent some obscure system or future optimization from breaking your code.

Humphrey Winnebago
  • 1,512
  • 8
  • 15
  • Isn't `std::memory_order_release` in (C) and `std::memory_order_acquire` inside the loop from workers' threads (`currentNode = currentNode.next.load( std::memory_order_acquire )`) enough? For when some worker thread see a change in some `next` pointer, it's going to be through an acquire load, so (A) and (B) that were previous to the release store responsible for that change would have to be visible to such a thread too. – Tarc Feb 16 '18 at 08:51
  • 1
    I forgot that you only have 1 writing thread, so Line 1 doesn't have to sync with other threads. You are correct that the other threads will see the writer's stores. But I still stand by what I said--Use acquire for Line 1 and release for Line B because it's semantically correct, it won't hurt anything, and it might prevent some obscure system or future optimization from breaking your code. – Humphrey Winnebago Feb 16 '18 at 09:20