std::atomic - behaviour of relaxed ordering

Question

Can the following call to print result in outputting stale/unintended values?

std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic

// Thread 1
void do_work() // seldom called
{
    // avoid over
    std::lock_guard<std::mutex> lock{g};
    i++; 
    j++;
    k++;
    seq.fetch_add(1, std::memory_order_relaxed);
}

// Thread 2
void consume_work() // spinning
{
    const auto s = g_s;
    // avoid overhead of constantly acquiring lock
    g_s = seq.load(std::memory_order_relaxed);
    if (s != g_s)
    { 
       // no lock guard
       print(i, j, k);
    }
}

Peter Cordes · Answer 1 · 2022-01-27T23:42:42.720

TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.

Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.

Relaxed `seq` with writer+reader using `lock_guard`

I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.

Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.

Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.

But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.

(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)

Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.

Use a Seq Lock

Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers. See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)

If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.

BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).

A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.

It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)

If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.

Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)

score 1 · Answer 2 · answered Jan 27 '22 at 12:13

Yes, it can.

First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.

Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

score 1 · Accepted Answer · answered Jan 27 '22 at 12:17

1

Even ignoring the staleness, this is causes a data race and UB.

Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.

answered Jan 27 '22 at 12:17

HolyBlackCat

78,603
9
131
207

Assuming we also guard the print statement, does the memory order of the atomic load/store need to be changed to ensure i, j and k are updated before the sequence increment? – wubzorz Jan 27 '22 at 12:41
@wubzorz It kinda looks safe, but relaxed atomics are hard to reason about, so I'm not 100% sure. – HolyBlackCat Jan 27 '22 at 12:56
I think relaxed is likely safe in practice on most machines, but it gets pretty hairy to sort out for formal ISO C++ which is really weak; taking a lock is truly just an acquire operation, and nothing has to exist if other threads aren't looking in UB-free ways. I said some about that in an answer, but didn't look *too* hard because a seqlock is a better plan. – Peter Cordes Jan 27 '22 at 23:44

std::atomic - behaviour of relaxed ordering

3 Answers3

Relaxed seq with writer+reader using lock_guard

Use a Seq Lock

Relaxed `seq` with writer+reader using `lock_guard`