3

Let's assume we have a SyncQueue class with the following implementation:

class SyncQueue {
    std::mutex mtx;
    std::queue<std::shared_ptr<ComplexType> > m_q;
public:
    void push(const std::shared_ptr<ComplexType> & ptr) {
        std::lock_guard<std::mutex> lck(mtx);
        m_q.push(ptr);
    }
    std::shared_ptr<ComplexType> pop() {
        std::lock_guard<std::mutex> lck(mtx);
        std::shared_ptr<ComplexType> rv(m_q.front());
        m_q.pop();
        return rv;
    }
};

then we have this code that uses it:

SyncQueue q;

// Thread 1, Producer:
std::shared_ptr<ComplexType> ct(new ComplexType);
ct->foo = 3;
q.push(ct);

// Thread 2, Consumer:
std::shared_ptr<ComplexType> ct(q.pop());
std::cout << ct->foo << std::endl;

Am I guaranteed to get 3 when ct->foo is printed? mtx provides happens-before semantics for the pointer itself, but I'm not sure that says anything for the memory of ComplexType. If it is guaranteed, does it mean that every mutex lock (std::lock_guard<std::mutex> lck(mtx);) forces full cache-invalidation for any modified memory locations up-till the place where memory hierarchies of independent cores merge?

neverlastn
  • 2,164
  • 16
  • 23
  • 3
    Uh, what if thread 2 acquires the mutex before thread 1? – Brian Bi Apr 13 '16 at 19:53
  • Any data that is written to memory causes that cache-line to be invalidated, but I don't know what would make you say "forces full cache-invalidation" – kmdreko Apr 13 '16 at 20:14
  • @Brian - Yes, correct - assume that the sequence is as shown above. – neverlastn Apr 13 '16 at 20:17
  • This answer to an older questions suggests that yes, mutex functions issue memory barrier instructions if required by the hardware: http://stackoverflow.com/a/24143387/1401351 – Peter Apr 13 '16 at 20:25
  • @Peter - that's great. As a clarification, does that mean that any writes, by any thread or process on that core's L1 and L2 will be marked as invalid across all other cores' caches, right? – neverlastn Apr 13 '16 at 20:29
  • 1
    `ct->foo = 3` happens-before `std::cout << ct->foo`, and so therefore the latter is guaranteed to observe the side effect of the former. The assignment is sequenced-before `q.push(ct)`, which happens-before `q.pop()`, which is sequenced-before read access to `ct->foo`. "happens-before" relationship is transitively closed over "happens-before" and "sequenced-before" edges. Cashes and cores are irrelevant implementation details. – Igor Tandetnik Apr 13 '16 at 21:12
  • @IgorTandetnik I certainly agree and I agree that "cashes and cores are irrelevant implementation details" in regards to semantics but I make it related to my question because it affects performance in multicore implementations i.e. every recent CPU. – neverlastn Apr 13 '16 at 21:17

1 Answers1

2

std::mutex() is conformant to Mutex requirements (http://en.cppreference.com/w/cpp/concept/Mutex)

Prior m.unlock() operations on the same mutex synchronize-with this lock operation (equivalent to release-acquire std::memory_order)

release-acquire is explained here (http://en.cppreference.com/w/cpp/atomic/memory_order)

Release-Acquire ordering

If an atomic store in thread A is tagged memory_order_release and an atomic load in thread B from the same variable is tagged memory_order_acquire, all memory writes (non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B, that is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory.

The synchronization is established only between the threads releasing and acquiring the same atomic variable. Other threads can see different order of memory accesses than either or both of the synchronized threads.

Code example in this section is very similar on yours. So it should be guaranteed that all writes in thread 1 will happen before mutex unlock in push().

Of course if "ct->foo = 3" hasn't any special tricky meaning where actual assignment happens in another thread :)

wrt cache-invalidation, from cppreference:

On strongly-ordered systems (x86, SPARC TSO, IBM mainframe), release-acquire ordering is automatic for the majority of operations. No additional CPU instructions are issued for this synchronization mode, only certain compiler optimizations are affected (e.g. the compiler is prohibited from moving non-atomic stores past the atomic store-release or perform non-atomic loads earlier than the atomic load-acquire). On weakly-ordered systems (ARM, Itanium, PowerPC), special CPU load or memory fence instructions have to be used.

So it really depends from the architecture.

Vadim Key
  • 1,242
  • 6
  • 15