In C++11 threads, what guarantees does a std::mutex have about memory visibility?

Question

I am currently trying to learn the C++11 threading api, and I am finding that the various resources don't provide an essential piece of information: how the CPU cache is handled. Modern CPUs have a cache for each core (meaning different threads may use a different cache). This means that it is possible for one thread to write a value to memory, and for another thread to not see it, even if it sees other changes the first thread also made.

Of course, any good threading API provides some way to solve this. In C++'s threading api, however, it is not clear how this works. I know that a std::mutex, for example, protects memory somehow, but it isn't clear what it does: does it clear the entire CPU cache, does it clear just the objects accessed inside the mutex from the current thread's cache, or something else?

Also, apparently, read-only access does not require a mutex, but if thread 1, and only thread 1, is continually writing to memory to modify an object, won't other threads potentiality see an outdated version of that object, thus making some sort of cache clearing necessary?

Do the atomic types simply bypass the cache and read the value from main memory using a single CPU instruction? Do they make any guarantees about the other places in memory being accessed?

How does memory access in C++11's threading api work, in the context of CPU caches?

Some questions, such as this one talk about memory fences, and a memory model, but no source seems to explain this in the context of CPU caches, which is what this question asks for.

Possible duplicate of [Does std::mutex create a fence?](https://stackoverflow.com/questions/11172922/does-stdmutex-create-a-fence) — Preet Kukreti, May 26 '18 at 02:26
@john01dav please can you answer this question after 2 years? i am still struggling to find this answer. Will values written in 1 thread's mutex lock be updated in anothers threads' mutex locks??? please explain and answer — Diljeet, Apr 26 '21 at 14:33

Preet Kukreti · Answer 1 · 2018-05-26T02:51:35.317

6

std::mutex has release-acquire memory ordering semantics, so everything in thread A that happened-before the atomic write to the critical section from thread A's point of view must be visible to thread B before entering the critical section in thread B.

Have a read of http://en.cppreference.com/w/cpp/atomic/memory_order to get started. Another good resource is the book C++ Concurrency in Action. Having said this, when using the high level synchronization primitives, you should be able to be able to get away with ignoring most of these details unless you are curious or want to get your hands dirty.

edited May 26 '18 at 02:51

answered May 26 '18 at 02:37

Preet Kukreti

8,417
28
36

From the reference you linked: "everything that took place in the critical section (before the release) in the context of thread A has to be visible to thread B (after the acquire) which is executing the same critical section." You say _before_ the critical section, this says _in_ any idea which is more correct? – Lockyer Sep 20 '19 at 22:16
@Lockyer It's a frame of reference thing. The salient point is that A's critical section will happen before B's critical section executes – Preet Kukreti Oct 01 '19 at 12:19

Dietrich Epp · Answer 2 · 2018-05-26T11:22:32.703

I think I understand what you are getting at. There are three things at play here.

The C++11 standard describes what happens at the language level... locking a std::mutex is a synchronization operation. The C++ standard does not describe how this works. CPU caches do not exist as far as the C++ standard is concerned.
The C++ implementation, at some point, puts some machine code in your application that implements a mutex lock. The engineers creating this implementation must take into account both the C++11 spec and the architecture spec.
The CPU itself manages the cache in such a way that to provide the semantics necessary for the C++ implementation to work.

This may be easier to understand if you look at atomics, which translate to much smaller snippets of assembly code but still provide synchronization. For example, try this one on GodBolt:

#include <atomic>

std::atomic<int> value;

int acquire() {
    return value.store(std::memory_order_acquire);
}

void release() {
    value.store(0, std::memory_order_release);
}

You can see the assembly:

acquire():
  mov eax, DWORD PTR value[rip]
  ret
release():
  mov DWORD PTR value[rip], 0
  ret
value:
  .zero 4

So on x86, there’s nothing necessary, the CPU already provides the required memory ordering semantics (although you can use an explicit mfence it’s usually implied by the operations). This is definitely not how it works on all processors, see the Power output:

acquire():
.LCF0:
0: addis 2,12,.TOC.-.LCF0@ha
  addi 2,2,.TOC.-.LCF0@l
  addis 3,2,.LANCHOR0@toc@ha # gpr load fusion, type int
  lwz 3,.LANCHOR0@toc@l(3)
  cmpw 7,3,3
  bne- 7,$+4
  isync
  extsw 3,3
  blr
  .long 0
  .byte 0,9,0,0,0,0,0,0
release():
.LCF1:
0: addis 2,12,.TOC.-.LCF1@ha
  addi 2,2,.TOC.-.LCF1@l
  lwsync
  li 9,0
  addis 10,2,.LANCHOR0@toc@ha
  stw 9,.LANCHOR0@toc@l(10)
  blr
  .long 0
  .byte 0,9,0,0,0,0,0,0
value:
  .zero 4

In here there are explicit isync instructions because the Power memory model provides fewer guarantees without them.

This is just punting things down to a lower level, however. The CPU itself manages shared caches using a technique like the MESI Protocol, which is a technique for maintaining cache coherence.

In the MESI protocol, when a core modifies a block of cache, it must flush that block from other caches. Other cores mark the block invalid, writing the contents out to main memory if necessary. This is inefficient, but necessary. For this reason you don't want to try and shove a bunch of commonly used mutexes or atomic variables in a small region of memory, because you can end up with multiple cores fighting over the same block of cache. The Wikipedia article is fairly comprehensive and has more detail than I'm writing here.

Something I'm omitting is the fact that mutexes typically require some kind of kernel-level support in order for threads to go to sleep or wake up.

`std::memory_order_acquire` isn't meaningful for `.store()` in the same way that `std::memory_order_release` isn't meaningful for `.load()`. Using it results in undefined behavior (apparently gcc maps it to something like `seq_cst` in this case rather than barfing). `clang` in this case doens't add any barrier, and `icc` fails telling you the order is invalid. The answer is still generally applicable though! Try `seq_cst` instead, example should work then. — BeeOnRope, May 26 '18 at 03:29
@BeeOnRope: Agh, that was a stupid bit of copy-paste there. Thanks. — Dietrich Epp, May 26 '18 at 11:22

In C++11 threads, what guarantees does a std::mutex have about memory visibility?

2 Answers2