lock free single producer multiple consumer data struct using atomic

Question

I have sample code like below recently (real code is much more complicated). After watched Hans Boehm's cppcon16 talk on atomic, I am bit anxious whether my code works.

produce is called by a single producer thread, and consume is called by multiple consumer threads. Producer only updates data in sequence number like 2, 4, 6, 8,... , but sets to odd sequence num like 1, 3, 5, 7, ... before update data to indicate data might be dirty. Consumers tries to fetch data in same sequence (2, 4, 6, ...) as well.

Consumer double checks the sequence num after read to make sure data are good (not updated by producer during read).

I think my code works fine on x86_64 (my target platform) because x86_64 does not reorder stores with other stores, or loads with stores or loads, but I suspect it's wrong on other platforms.

Am I correct that data assignment (in produce) can be moved to above 'store(n-1)' so consumer reads corrupted data but the t == t2 still succeeds?

struct S 
{
    atomic<int64_t> seq;
    // data members of primitive type int, double etc    
    ...
};

S s;

void produce(int64_t n, ...) // ... for above data members
{
    s.seq.store(n-1, std::memory_order_release); // indicates it's working on data members

    // assign data members of s
    ...

    s.seq.store(n, std::memory_order_release); // complete updating
}

bool consume(int64_t n, ...) // ... for interested fields passed as reference
{
    auto t = s.load(std::memory_order_acquire);

    if (t == n)
    {
        // read fields
        ...

        auto t2 = s.load(std::memory_order_acquire);
        if (t == t2)
            return true;
    }        

    return false;
}

A note of caution: lock-free structures like this are very hard to implement properly. There's a number of pitfalls: http://blog.memsql.com/common-pitfalls-in-writing-lock-free-algorithms/ — Violet Giraffe, Oct 10 '16 at 11:02
http://en.cppreference.com/w/cpp/atomic/memory_order#Release-Acquire_ordering deals exactly with this question. Does it answer it for you? — Michael Foukarakis, Oct 10 '16 at 15:26
Michael, the cppref link is very different to my question. The problem I'm trying to resolve is that the producer updates the same 'S s;' multiple times, and I want to make sure consumers read the data of the specified sequence number if producer has not updated the data to a higher seq no. — Derek, Oct 10 '16 at 15:45
@VioletGiraffe: this synchronization pattern is already well-known: it's a single-writer variant of a [Seqlock](https://en.wikipedia.org/wiki/Seqlock). — Peter Cordes, Oct 12 '16 at 05:49

score 3 · Accepted Answer · edited May 23 '17 at 12:00

Compile-time reordering can still bite you when targeting x86, because the compiler optimizes to preserve the behaviour of the program on the C++ abstract machine, not any stronger architecture-dependent behaviour. Since we want to avoid memory_order_seq_cst, reordering is allowed.

Yes, your stores can reorder as you suggest. Your loads can also reorder with the t2 load, since an acquire-load is only a one-way barrier. It would be legal for a compiler to optimize away the t2 check entirely. If a reordering is possible, the compiler is allowed to decide that it's what always happens and apply the as-if rule to make more efficient code. (Current compilers usually don't, but this is definitely allowed by the current standard as written. See the conclusion of a discussion about this, with links to standards proposals.)

Your options for preventing reordering are:

Make all the data-member stores/loads atomic with release and acquire semantics. (The acquire-load of the last data member would keep the t2 load from being done first.)
Use barriers (aka fences) to order all the non-atomic stores and non-atomic loads as a group.

As Jeff Preshing explains, a mo_release fence isn't the same thing as a mo_release store, and is the kind of bidirectional barrier we need. std::atomic just recycles the std::mo_ names instead of giving different names for the fences.

(BTW, the non-atomic stores/loads should really be atomic with mo_relaxed, because it's technically Undefined Behaviour to read them at all while they might be in the process of being rewritten, even if you decide not to look at what you read.)

void produce(int64_t n, ...) // ... for above data members
{
    /*********** changed lines ************/
    std::atomic_signal_fence(std::memory_order_release);  // compiler-barrier to make sure the compiler does the seq store as late as possible (to give the reader more time with it valid).
    s.seq.store(n-1, std::memory_order_relaxed);          // changed from release
    std::atomic_thread_fence(std::memory_order_release);  // StoreStore barrier prevents reordering of the above store with any below stores.  (It's also a LoadStore barrier)
    /*********** end of changes ***********/

    // assign data members of s
    ...

    // release semantics prevent any preceding stores from being delayed past here
    s.seq.store(n, std::memory_order_release); // complete updating
}



bool consume(int64_t n, ...) // ... for interested fields passed as reference
{
    if (n == s.seq.load(std::memory_order_acquire))
    {
        // acquire semantics prevent any reordering with following loads

        // read fields
        ...

    /*********** changed lines ************/
        std::atomic_thread_fence(std::memory_order_acquire);  // LoadLoad barrier (and LoadStore)
        auto t2 = s.seq.load(std::memory_order_relaxed);    // relaxed: it's ordered by the fence and doesn't need anything extra
        // std::atomic_signal_fence(std::memory_order_acquire);  // compiler barrier: probably not useful on the load side.
    /*********** end of changes ***********/
        if (n == t2)
            return true;
    }

    return false;
}

Notice the extra compiler-barrier (signal_fence only affects compile-time reordering) to make sure the compiler doesn't merge the second store from one iteration with the first store from the next iteration, if this is run in a loop. Or more generally, to make sure the store that invalidates the region is done as late as possible, to reduce false positives. (Probably not necessary with real compilers, and with plenty of code between calls to this function. But signal_fence never compiles to any instructions, and seems like a better choice than keeping the first store as mo_release. On architectures where a release-store the thread-fence both compile to extra instructions, a relaxed store avoids having two separate barrier instructions.)

I was also worried about the possibility of the first store reordering with the release-store from the previous iteration. But I don't think that can ever happen, because both stores are to the same address. (At compile-time, maybe the standard allows a hostile compiler to do it, but any sane compiler would instead just not do one of the stores at all if it thought one could pass the other.) At run-time on a weakly-ordered architecture, I'm not sure if stores to the same address can ever become globally visible out of order. This shouldn't be a problem in real life since the producer presumably isn't called back-to-back.

BTW, the synchronization technique you're using is a Seqlock, but with only a single-writer. You only have the sequence part, not the lock part to synchronize separate writers. In a multi-writer version, writers would take the lock before reading/writing the sequence numbers and data. (And instead of having the seq no as a function arg, you'd read it from the lock).

C++ standards-discussion paper N4455 (about compiler-optimizations of atomics, see the second half of my answer on Can num++ be atomic for 'int num'?) uses it as an example.

Instead of a StoreStore fence, they use release-stores for the data items in the writer. (With atomic data items, as I mentioned is required for this to really be correct).

void writer(T d1, T d2) {
  unsigned seq0 = seq.load(std::memory_order_relaxed);  // note that they read the current value because it's presumably a multiple-writers implementation.
  seq.store(seq0 + 1, std::memory_order_relaxed);
  data1.store(d1, std::memory_order_release);
  data2.store(d2, std::memory_order_release);
  seq.store(seq0 + 2, std::memory_order_release);
}

They talk about letting the reader's second load of the sequence number potentially reorder with later operations, if it's profitable for the compiler to do so, and using t2 = seq.fetch_add(0, std::memory_order_release) in the reader as a potential way to get a load with release semantics. With current compilers, I would not recommend that; you're likely to get a locked operation on x86 where the way I suggested above doesn't have any (or any actual barrier instructions, because only full-barrier seq_cst fences need an instruction on x86).

@cmaster: You're thinking that a release-store is a two-way StoreStore barrier that orders all other stores. It isn't, [as Jeff Preshing explains in this article](http://preshing.com/20131125/acquire-and-release-fences-dont-work-the-way-youd-expect/), where he explains the distinction between release-store and stand-alone `atomic_thread_fence(mo_release)`, which is a barrier. It's a well-discussed fact that release-stores are only one-way barriers, and same for acquire-loads, and Jeff's article links to a few other articles / talks that mention this. — Peter Cordes, Oct 13 '16 at 17:13
Ouch. Yes, of course, you are right. Sorry for that, I deleted my comment. Unfortunately, I cannot remove my downvote anymore :-( — cmaster - reinstate monica, Oct 13 '16 at 18:31
@cmaster: I found an editing mistake to fix, and expanded on the first paragraph to maybe prevent future readers from starting with the same "this sounds wrong" impression you got. So now you can change your vote :) — Peter Cordes, Oct 13 '16 at 18:45

lock free single producer multiple consumer data struct using atomic

1 Answers1