3

I'm looking to create a fast synchronization model to share a message queue between two threads. I'm looking at a class with one vector, a mutex, and a dirty flag like so.

#include <vector>
#include <mutex>

class SharedVector {
    std::mutex vector_lock;
    std::vector<int> vec;
    volatile bool dirty = false;

    void write(int a) {
        std::unique_lock lock(vector_lock);
        vector.push_back(a);                // Step 1
        dirty = true;                       // Step 2
    }

    void consume(std::vector<int>& container) {
        if (dirty) {
            std::unique_lock lock(vector_lock);
            container = std::move(vec);         // Invalidates original vec
            dirty = false;
        }
    } 
};

I'm under the assumption that this implementation is thread safe, even though the dirty flag is read outside of a lock, but I'm not sure. The reasoning behind this are

  1. The dirty flag is guaranteed to only be written by one thread at a time, since all writes are inside an atomic section, behind a lock.

  2. Although the dirty flag may be set to true (Step 2 in write()) before the vector is appended (Step 1), because of CPU instruction reordering, that should not be a problem since the consume() thread must wait until the lock is released by the write() thread.

  3. The write() thread should invalidate the dirty flag once it is written to, according to cache coherency protocols on x86, which I believe to be either MESI or MOESI.

Would anybody be able to tell me if this implementation is actually thread safe, or let me know if there are any downfalls to this? I am avoiding setting the dirty flag as atomic, since that has caused a significant slowdown which I am trying to avoid.

Edit:

A few constraints to this problem:

  1. There will only be one reader thread and one writer thread.
  2. The consume() function must be nonblocking, since the reader thread has other work to do if there are no messages available.
  • 1
    Why not use std::dequeue and have a look at std::condition_variable, it can help you block the consume call if no data is present. – Pepijn Kramer Aug 14 '23 at 17:39
  • 2
    Is is UB, period. Perhaps try a relaxed atomic? – HolyBlackCat Aug 14 '23 at 17:43
  • 1
    @PepijnKramer Sorry, I should've been more clear. We are looking for a nonblocking solution. The reading thread does other work if no work is available on the vector. – user22389543 Aug 14 '23 at 17:49
  • 1
    So basically you're looking for a "lock free queue", I have no experience with writing lock free code. But I think boost has one. https://www.boost.org/doc/libs/1_54_0/boost/lockfree/queue.hpp – Pepijn Kramer Aug 14 '23 at 18:26
  • The key here is that C++'s definition of "atomic" imposes requirements that the hardware's definition of "atomic" does not. Atomic in hardware is not the same as atomic in C++. – Pete Becker Aug 14 '23 at 18:51

3 Answers3

4

I'm under the assumption that this implementation is thread safe, even though the dirty flag is read outside of a lock, but I'm not sure.

This assumption is false. Any time on thread writes to memory and another reads from memory at the same time, there is a data race, and the behavior is undefined. Your code reads dirty in if (dirty), and this may happen at the same time as dirty = true takes place.

volatile also doesn't fix this, though it used to be a quasi-atomic in some pre-C++11 implementations, and this might still be the case for legacy compatibility. See When to use volatile with multi threading?

Some people may tell you that at least on x86_64, this works anyway, because all aligned loads and stores up to 8 bytes are atomic anyway. However, you should not rely on this, and thread sanitizers could be triggered regardless.

std::atomic_bool for inter-thread flags

For this code to be safe and portable, you must use std::atomic_bool instead, or the first read of dirty needs to be inside the critical section.

#include <vector>
#include <mutex>
#include <atomic>

class SharedVector {
    std::mutex vector_lock;
    std::vector<int> vec;
    // note: could be std::atomic<bool> (stylistic preference)
    std::atomic_bool dirty = false;

    void write(int a) {
        std::unique_lock lock(vector_lock);
        vector.push_back(a);
        // note: this could probably use relaxed memory order (?), due to the
        //       lock() and unlock() of the mutex behaving like acquire/release,
        //       and preventing this operation from being pulled
        //       out the critical section.
        // note: on x86_64, this will compile to a simple mov, since x86_64 memory
        //       has acquire/release semantics by default
        dirty.store(std::memory_order::release);
    }

    // note: return bool to indicate success
    bool try_consume(std::vector<int>& container) {
        // note: this could probably use relaxed memory order too
        if (dirty.load(std::memory_order::acquire)) {
            std::unique_lock lock(vector_lock);
            container = std::move(vec);
            // note: clear() is unlikely to do anything, but returns the
            //       object into a known state, instead of keeping it in a
            //       moved-from state, which isn't specified in detail.
            vec.clear();
            // note: this can probably be relaxed too, due to the
            //       surrounding critical section
            dirty.store(false, std::memory_order::release);
            return true;
        }
        return false;
    } 
};

std::condition_variable for producer/consumer synchronization

In addition to (or instead of) a try_consume operation, it could make sense to use std::condition_variable to let the consuming thread block until the vector becomes dirty.

#include <vector>
#include <mutex>
#include <thread>

class SharedVector {
    std::mutex vector_lock;
    std::vector<int> vec;
    std::condition_variable condition;

    void produce(int a) {
        std::unique_lock lock(vector_lock);
        vector.push_back(a);
        // wake up one thread which is waiting for elements to become ready
        condition.notify_one();
    }

    void consume(std::vector<int>& container) {
        std::unique_lock lock(vector_lock);
        // wait until there are any elements available
        // waiting may temporarily unlock the mutex, but only while
        // this thread is idling
        condition.wait(lock, [&] {
            // this is thread-safe, because we hold the lock in this lambda
            return !vec.empty();
        });
        // this is still thread-safe; we hold the lock here
        container = std::move(vec);
    } 
};
Jan Schultke
  • 17,446
  • 6
  • 47
  • 96
  • I'll take a look at this and the other memory orderings. Thank you! – user22389543 Aug 14 '23 at 18:17
  • 1
    Your consumer still has a `dirty = false` instead of `dirty.store(false, std::memory_order_relaxed)`. On x86 that costs an extra full-barrier the OP was trying to avoid. And yes, you could be using `relaxed` in all of those to let them compile more efficiently for ISAs like ARM32 where any ordering stronger than relaxed needs a strong barrier. It's only advisory, and only modified inside the critical section, read outside. (x86 and AArch64 can do pretty efficient acq/rel, especially x86 and ARMv8.3 or whatever it is that added `ldapr`.) – Peter Cordes Aug 14 '23 at 18:21
  • @PeterCordes oops, that wasn't my intention. Fixed it now; `std::memory_order::release` (though I suspect it can be `relaxed` in every situation). – Jan Schultke Aug 14 '23 at 18:23
3

As @Pepijn Kramer points out, you might actually want to use a lock-free queue, e.g. Boost's implementation. https://www.boost.org/doc/libs/1_83_0/doc/html/lockfree.html

If working in batches works well for your use-case, though, this might be pretty reasonable in terms of total amount of contention for the amount of work that client threads receive.


Use std::atomic with relaxed to compile the same as volatile

volatile bool compiles to the same asm as std::atomic<bool> with std::memory_order_relaxed. volatile doesn't avoid data-race UB in the ISO C++ standard, so prefer std::atomic. See When to use volatile with multi threading? - Basically never.

volatile works because all real-world machines have coherent cache across the cores that C++ std::threads can run across, and because of how mainstream compilers implement what the standard says about volatile. But unless you're writing Linux kernel code, or some other project that rolls its own atomics with volatile and inline asm, just use std::atomic because you can write code that will compile as cheaply as volatile.

I am avoiding setting the dirty flag as atomic, since that has caused a significant slowdown which I am trying to avoid.

std::atomic isn't inherently slower, it's just about how you use it. (In debug builds, atomic can be slower because its template functions don't inline, but don't benchmark debug builds.)

Probably you used stronger memory_orders than necessary. Or atomic RMWs when there was only one writer so you could have done x.store(x.load(relaxed) + 1, release) or something instead of a seq_cst atomic RMW like x++.


Reusing a std::vector after std::move?

Note that multiple consumers can see the same dirty == true. The later ones to enter the critical section will std::move from an "invalid" vec, so hopefully you're prepared to detect that. (Or re-check dirty inside the critical section, promoting it from just an optimization measure to being part of maintaining correctness.)

Also, the next write will .push_back on an "invalid" vector that's been moved from.

You should probably vec.clear() after moving from it so there's a well-defined vector to push_back into. (See Reusing a moved container?). What you're doing might happen to work, but it doesn't seem to be guaranteed by the standard. vec.clear() is extremely cheap since it doesn't change the capacity, it just has to set the .end pointer to .data. (under the hood, std::vector is usually a struct of 3 pointers: start, end-of-size, end-of-capacity). Hrm, since .clear() is guaranteed to work on a vector that's been moved from, I guess that means std::move has to set .data to a null pointer, otherwise it would still be pointing at the space now owned by the move destination.

I just checked what GCC with libstdc++ and clang with libc++ do (Godbolt): move from a vector zeros all three members, so it's in a consistent state ready to use. And .clear() with libstdc++ checks if .end is already equal to .data, I guess to avoid dirtying the cache line if it was already empty? With libc++, adding .clear() after a move doesn't change the asm at all: the compiler knows the operation is redundant.

But I think it's not guaranteed by ISO C++ that it's safe to .push_back() on a vector that was the source operand of std::move(), unless you .clear() first.

Avoid move, just swap allocations of two vectors

Or probably better, vec.swap(container) and vec.clear(), so you're not deallocating / reallocating anywhere. One possible downside is that if your vector ever gets huge, that amount of memory will stay allocated since there's no shrink-to-fit. Unless maybe you do that on container inside the if, outside the unique_lock

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for the suggestions. Fortunately, there is only one reader thread, so that was considered. – user22389543 Aug 14 '23 at 19:37
  • @user22389543: With a single consumer, a lock-free queue is probably even better, since the biggest benefit of your `dirty` flag is readers not contending with each other, with `dirty` able to stay in Shared state in each reader's L1d cache. Even reading the same cache line as the lock and `std::vector` do contend with a writer that needs exclusive ownership of those lines to write them, although out-of-order exec hides some of that. SPSC lock-free queues can be simpler than MPMC. (Although if you have multiple writers, IDK if there are any implementations specifically optimized for MPSC). – Peter Cordes Aug 14 '23 at 19:50
1

In your case no, as your read is accessing mutable shared memory in an unsynchronized manner. You'd have to make that access tbreadsafe by using something like std::atomic<bool>.

You could also take a look at std::condition_variable if letting the reading threads idle while there's no data is something you'd like.

Vivick
  • 3,434
  • 2
  • 12
  • 25
  • You forgot to mention that they can use `memory_order_relaxed` to get it to compile to about the same asm as `volatile`, so they get all of the same works-in-practice behaviour based on cache coherency, but also guaranteed by the ISO C++ standard, with no extra performance cost from `std::atomic` (assuming of course they compile with optimization enabled so it inlines). [When to use volatile with multi threading?](https://stackoverflow.com/a/58535118) – Peter Cordes Aug 14 '23 at 17:53
  • Wouldn't cache coherency guarantee that the status of the dirty flag is not invalid at a given moment? Before writing to the dirty flag, the writing thread should set all copies of that cache line as invalid to all threads that share it, thus forcing a read with the most up to date value. – user22389543 Aug 14 '23 at 17:53
  • @user22389543: The ISO C++ standard still says it's UB, despite the fact that real implementation compile it the same as a `relaxed` atomic. (Pre-C++11, `volatile` was the standard way to do lock-free atomic load/store, and existing compilers still de-facto support that. The Linux kernel even relies on GCC/Clang's behaviour with `volatile` plus inline asm.) But in this case, you're getting zero advantage from `volatile` over `std::atomic` with `relaxed` loads/stores, so just do that instead. – Peter Cordes Aug 14 '23 at 17:56
  • @PeterCordes I'll try the memory ordering solution, thank you. – user22389543 Aug 14 '23 at 18:18
  • Isn't volatile just a way to tell the compiler "do not optimize or re-order reads and writes to this variable"? – Vivick Aug 15 '23 at 06:53