Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)

Question

I found an intel document which states memory barriers are required when string (not std::string, but assembly string instructions) are used, to prevent them being re-ordered by the CPU.

However, are memory barriers also required when two threads (on two different cores) are accessing the same memory? The scenario I had in mind is where one of the CPUs which doesn't "own" the cache line writes to this memory and the core writes to its store buffer (as opposed to its cache). A memory barrier is required to flush the value from the store buffer to the cache, so the other core can obtain this value?

I am unsure whether, on Intel, the MESI protocol handles this?

(what I have tried to (badly) explain above is better-described in the below paper, pages 6-12):

http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

The above paper is very general and I am unsure how Intel CPUs practically handle the problem.

I think you're talking about the IvyBridge enhanced `rep movs` feature (ERMSB). It uses weakly-ordered writes, but you don't need barriers if you didn't for copying manually. (see my answer) — Peter Cordes, Nov 29 '15 at 13:10

score 4 · Answer 1 · answered Dec 17 '14 at 16:27

MESI protocols apply to caches, store buffering is essentially pre-cache, meaning that it's a store that was not yet "released" to the outside world, and its synchronization point was not yet determined.

You also need to keep in mind that cache coherency only guarantees that writes don't occur on stale copies of a cacheline and get lost along the way. The only guarantee of such protocols is to hide the fact that you have caches with copied values (a performance optimization in itself), and expose to the programmer/OS the illusion of a single level flat physical memory.

That, by itself, gives you no guarantee on the ordering of writes and reads from multiple cores, for that purpose you need to manage your code using additional constructs that the ISA provides, like locks, fences, and relying on memory ordering rules.

The situation you describe is not possible as it breaks the first part - a core that does not own a line can't write to memory since it would miss the updated data in the core that does own the line (if such exists). What would happen under a MESI protocol is that the write will be buffered for a while, and when its turn comes to be issued - it would send a request for ownership that would invalidate all copies of that line in other cores (triggering a writeback if there's a modified copy), and fetch the updated data. Only then the writer core may modify the line and mark it as modified.

However, if 2 cores write to the same line simultaneously, MESI protocol only guarantees that these write will have some order, not a specific one you might want. Worse - if each core write several lines and you want atomicity around these writes, MESI doesn't guarantee that. You'll need to actively add a mutex or a barrier of some sort to force the HW to perform the writes the way you want.

Could you take a look at the paper I posted? I am basically trying to understand if that is just some academic nonsense or whether the problem described does occur in the real world on Intel CPUs? — user997112, Dec 17 '14 at 23:13
Just a factoid: x86 has a somewhat strong memory consistency model (but not sequential consistency), so in some cases no memory barrier would be needed for x86 where it would be need for ARM or Power. — , Dec 18 '14 at 03:14
@PaulA.Clayton, true, but that's not due to MESI (which both could be using, and is actually a micro-architectural aspect) — Leeor, Dec 18 '14 at 06:49
@user997112, if you mean the scenario at page 8, it isn't possible on x86 with WB memory type since stores are not allowed to reorder with stores. Therefore step #3 can not occur - the 2nd write has to wait for the first one to complete the ownership transfer and modification before using the cacheline even if it's already there. Yes, that mean the line may be lost and forced to be requested again. This isn't related to cache protocols, but rather to store buffer and Memory unit design — Leeor, Dec 18 '14 at 06:59
@Leeor, thanks. So my question is- on Intel x86 what are memory barriers used for? (Besides preventing ordering of string instructions). Is there any additional purpose for them? — user997112, Dec 18 '14 at 09:20
@user997112 If you mean fence instructions, these posts may help - http://stackoverflow.com/questions/20316124/does-it-make-any-sense-instruction-lfence-in-processors-x86-x86-64 and here - http://stackoverflow.com/questions/20326280/what-is-the-impact-sfence-and-lfence-to-caches-of-neighboring-cores/20329574#20329574 — Leeor, Dec 18 '14 at 11:33

score 1 · Answer 2 · edited May 23 '17 at 12:32

I think you're talking about ERMSB (fast strings) in Intel IvB and later making rep movs use weakly-ordered writes.

My conclusion from Intel's docs is that you still don't need SFENCE to order those stores relative to other stores, and of course you can't run SFENCE in the middle of a rep movsb. See that answer for more stuff in general about memory barriers on x86.

AFAICT, all you need to do is avoid using the same rep movs to write a buffer and the flag that readers will check to see if the buffer is ready. A reader could see the flag before all of the stores to the buffer are visible to it. This is the only way the new ERMSB feature affects correctness, for programs that were already correct (i.e. didn't depend on flukes of timing). It has a positive effect on performance for memcpy / memset.

Is the MESI protocol enough, or are memory barriers still required? (Intel CPUs)

2 Answers2

Linked