Can I force cache coherency on a multicore x86 CPU?

Question

The other week, I wrote a little thread class and a one-way message pipe to allow communication between threads (two pipes per thread, obviously, for bidirectional communication). Everything worked fine on my Athlon 64 X2, but I was wondering if I'd run into any problems if both threads were looking at the same variable and the local cached value for this variable on each core was out of sync.

I know the volatile keyword will force a variable to refresh from memory, but is there a way on multicore x86 processors to force the caches of all cores to synchronize? Is this something I need to worry about, or will volatile and proper use of lightweight locking mechanisms (I was using _InterlockedExchange to set my volatile pipe variables) handle all cases where I want to write "lock free" code for multicore x86 CPUs?

I'm already aware of and have used Critical Sections, Mutexes, Events, and so on. I'm mostly wondering if there are x86 intrinsics that I'm not aware of which force or can be used to enforce cache coherency.

Are you wanting cross-platform stuff or are you on Windows or Linux? — Eclipse, Feb 17 '09 at 21:50
Probably just Windows for now. The code base may extend to MacOS, WinMobile, and whatever the iPhone uses at some point, but initial development is under Win32/64. — Furious Coder, Feb 17 '09 at 23:33
It's common misconception, volatile does not mean "to refresh from memory". Check the video about lock-free from Fedor Pikus, he describes "volatile" as well. https://youtu.be/lVBvHbJsg5Y?t=16m17s — avp, Mar 17 '18 at 07:21

score 39 · Accepted Answer · edited Sep 03 '16 at 05:20

39

volatile only forces your code to re-read the value, it cannot control where the value is read from. If the value was recently read by your code then it will probably be in cache, in which case volatile will force it to be re-read from cache, NOT from memory.

There are not a lot of cache coherency instructions in x86. There are prefetch instructions like prefetchnta, but that doesn't affect the memory-ordering semantics. It used to be implemented by bringing the value to L1 cache without polluting L2, but things are more complicated for modern Intel designs with a large shared inclusive L3 cache.

x86 CPUs use a variation on the MESI protocol (MESIF for Intel, MOESI for AMD) to keep their caches coherent with each other (including the private L1 caches of different cores). A core that wants to write a cache line has to force other cores to invalidate their copy of it before it can change its own copy from Shared to Modified state.

You don't need any fence instructions (like MFENCE) to produce data in one thread and consume it in another on x86, because x86 loads/stores have acquire/release semantics built-in. You do need MFENCE (full barrier) to get sequential consistency. (A previous version of this answer suggested that clflush was needed, which is incorrect).

You do need to prevent compile-time reordering, because C++'s memory model is weakly-ordered. volatile is an old, bad way to do this; C++11 std::atomic is a much better way to write lock-free code.

edited Sep 03 '16 at 05:20

Peter Cordes

328,167
45
605
847

answered Feb 17 '09 at 21:58

SoapBox

20,457
3
51
87

1

What's the right order here then? _InterlockedExchange(); // atomic write _clflush() // sync caches _mfence() // cause a wait until caches synced Or do I need another _mfence() above the _clflush()? Thanks. – Furious Coder Feb 20 '09 at 22:37
1

AtomicWrite, Memory fence to wait for the AtomicWrite to hit the cache, CacheFlush, Memory Fence to make sure the next thing you write isn't visible until after the flush. This last fence may not be needed, I'm not sure. – SoapBox Feb 20 '09 at 22:47
Okay, cool, I'll try that. Of course I have to wrap the whole thing in a conditional to determine whether _cflush exists, and since the whole thing should be packed tightly, I'm guessing I should just have an inline function that decides what to do based on a runtime system info class. Thanks! – Furious Coder Feb 21 '09 at 01:20
-1 the whole point of 'volatile' is to force the CPU to ignore cached values. Maybe your version of 'volatile' is broken. – cmcginty Sep 21 '09 at 22:55
5

The answer is right. @SoapBox probably means the cpu cache - but what you talk about is caching a result into a register. In essence, volatile is for declaring "device register" variables - which tells the compiler "this doesn't read from memory, but from an external source" - and so the compiler will re-read it any time since it can't be sure the read value will equal to the value last written. If "read" for your implementation is defined to issue a "loadw", then surely it will sometimes read from the CPU cache - but that's fine from C's point of view. – Johannes Schaub - litb Sep 22 '09 at 04:41
The clflush part of this answer was totally wrong. Invalidation of copies of the line in other caches happens before a line can be modified, not at write-back. If it waited until write-back, different cores could have conflicting copies of the same cache line, violating cache coherency. (And yes, then you would need `clflush` to get coherency between cores, but that's not how CPUs work. Even weakly-ordered architectures like ARM have coherent data caches). I started out making an edit just to fix the 3rd paragraph, then saw the whole rest of the answer followed from that premise... – Peter Cordes Sep 03 '16 at 05:23
For producer/consumer; nothing guarantees that the consumer won't check the value before the producer produces it; so in practice, if you need a fence (for the "producer produced before consumer consumed" case) then you will also need something much stronger than a fence (for the "consumer consumed before the producer produced" case) - e.g.a `while(produced_hasn't_produced_yet) {` loop, possibly containing a `pthread_cond_wait()`. – Brendan Jul 05 '19 at 02:16

score 26 · Answer 2 · answered Feb 17 '09 at 22:06

26

Cache coherence is guaranteed between cores due to the MESI protocol employed by x86 processors. You only need to worry about memory coherence when dealing with external hardware which may access memory while data is still siting on cores' caches. Doesn't look like it's your case here, though, since the text suggests you're programming in userland.

answered Feb 17 '09 at 22:06

About about multi-processor systems? – SoapBox Feb 17 '09 at 22:22
8

MESI protocol is not used in x86, but MESIF and MOESI are. – osgx Feb 17 '10 at 12:22
7

x86 does handle coherence. But read up on memory *consistency*: it's not guaranteed that all writes (such as writing the data and releasing the lock, to name two) will be visible to all CPUs in the same order! That's what the memory fences are for. – Wim Feb 27 '10 at 19:08
3

@Wim On x86/x64 Memory writes ARE guaranteed visible in the same order hence memory fences unnecessary on this platform, the only possible issue is compiler re-ordering. Read the intel developers manual or here for a short version http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf – camelccc Jan 20 '13 at 15:55
1

@camelccc: StoreStore reordering isn't allowed on x86, but stores can become globally visible after following loads. x86 loads/stores have acquire/release semantics, not sequential consistency. You can observer StoreLoad reordering in practice on real hardware: http://preshing.com/20120515/memory-reordering-caught-in-the-act/. So you're wrong that memory fences aren't needed on x86, but you're right that they're not needed *for this*. Sill, you need C++ code like `var.store(newval, std::memory_order_release)` to avoid compile-time reordering even when compiling for x86. – Peter Cordes Sep 03 '16 at 04:48

score 20 · Answer 3 · edited Jul 05 '19 at 01:58

20

You don't need to worry about cache coherency. The hardware will take care of that. What you may need to worry about is performance issues due to that cache coherency.

If core#1 writes to a variable, that invalidates all other copies of the cache line in other cores (because it has to get exclusive ownership of the cache line before committing the store). When core#2 reads that same variable, it will miss in cache (unless core#1 has already written it back as far as a shared level of cache).

Since an entire cache line (64 bytes) has to be read from memory (or written back to shared cache and then read by core#2), it will have some performance cost. In this case, it's unavoidable. This is the desired behavior.

The problem is that when you have multiple variables in the same cache line, the processor might spend extra time keeping the caches in sync even if the cores are reading/writing different variables within the same cache line.

That cost can be avoided by making sure those variables are not in the same cache line. This effect is known as False Sharing since you are forcing the processors to synchronize the values of objects which are not actually shared between threads.

edited Jul 05 '19 at 01:58

Peter Cordes

328,167
45
605
847

answered Feb 17 '09 at 21:54

Ferruccio

98,941
38
226
299

5

The "has to be read from memory" bit is misleading, as the data might be snooped from another cache. – ArtemGr Jun 25 '10 at 05:53
3

I hadn't thought of that. I assume there would still be a performance cost, but not of the same magnitude as a read from RAM. – Ferruccio Jun 25 '10 at 13:32
I think a mentioning of *False Sharing* is justified here? – WiSaGaN Apr 16 '14 at 03:19
@WiSaGaN - isn't that what the last paragraph of my answer is describing? or am I missing something? – Ferruccio Apr 16 '14 at 13:55
1

Yeah, that's exactly what you mentioned here. Since there is already an established name for it, we can add the name here. – WiSaGaN Apr 16 '14 at 13:57

score 8 · Answer 4 · answered Feb 17 '09 at 21:51

8

Volatile won't do it. In C++, volatile only affects what compiler optimizations such as storing a variable in a register instead of memory, or removing it entirely.

answered Feb 17 '09 at 21:51

dsimcha

67,514
53
213
334

score 6 · Answer 5 · answered Feb 17 '09 at 21:54

You didn't specify which compiler you are using, but if you're on windows, take a look at this article here. Also take a look at the available synchronization functions here. You might want to note that in general volatile is not enough to do what you want it to do, but under VC 2005 and 2008, there are non-standard semantics added to it that add implied memory barriers around read and writes.

If you want things to be portable, you're going to have a much harder road ahead of you.

score 3 · Answer 6 · answered Feb 18 '09 at 13:30

3

There's a series of articles explaining modern memory architectures here, including Intel Core2 caches and many more modern architecture topics.

Articles are very readable and well illustrated. Enjoy !

answered Feb 18 '09 at 13:30

davidnr

3,443
2
19
14

score 3 · Answer 7 · answered Apr 13 '09 at 20:52

There are several sub-questions in your question so I'll answer them to the best of my knowledge.

There currently is no portable way of implementing lock-free interactions in C++. The C++0x proposal solves this by introducing the atomics library.
Volatile is not guaranteed to provide atomicity on a multicore and its implementation is vendor-specific.
On the x86, you don't need to do anything special, except declare shared variables as volatile to prevent some compiler optimizations that may break multithreaded code. Volatile tells the compiler not to cache values.
There are some algorithms (Dekker, for instance) that won't work even on an x86 with volatile variables.
Unless you know for sure that passing access to data between threads is a major performance bottleneck in your program, stay away from lock-free solutions. Use passing data by value or locks.

making the variable volatile is just one part of the puzzle. That does not solve the concurrency issue. Memory fencing would be necessary to make sure that the variable access is synchronized across all the processor cores. — Jay D, Dec 30 '10 at 00:44
update: C11 and C++11 introduced std::atomic for [lock-free programming](http://preshing.com/20120612/an-introduction-to-lock-free-programming/). — Peter Cordes, Sep 03 '16 at 04:52

score 2 · Answer 8 · answered Sep 22 '09 at 02:18

2

The following is a good article in reference to using volatile w/ threaded programs.

Volatile Almost Useless for Multi-Threaded Programming.

answered Sep 22 '09 at 02:18

cmcginty

113,384
42
163
163

greyfade · Answer 9 · 2009-02-18T07:47:43.113

1

Herb Sutter seemed to simply suggest that any two variables should reside on separate cache lines. He does this in his concurrent queue with padding between his locks and node pointers.

Edit: If you're using the Intel compiler or GCC, you can use the atomic builtins, which seem to do their best to preempt the cache when possible.

edited Feb 18 '09 at 07:47

answered Feb 17 '09 at 21:59

greyfade

24,948
7
64
80

Of course, fixed-length padding will likely fail on some later chip. – David Thornley Feb 17 '09 at 22:07
1

Of course, you can always choose a larger pad later on if the existing one is too small. It might make a cache miss more likely, but isn't that the point? – greyfade Feb 17 '09 at 23:41
We can't target hypothetical future processors. Write code that works well on todays processors. – doug65536 Sep 05 '12 at 13:24

Can I force cache coherency on a multicore x86 CPU?

9 Answers9

Linked