12

What is the purpose of the x86 LOCK prefix, if the MESI protocol prevents other cores from writing to "exclusive"-ly owned data anyway?

I am getting a little confused between what LOCK provides and what MESI provides?

I understand the MESI protocol is about ensuring the cores all see a consistent state of memory, but as I understand, it also prevents cores from writing to memory which another core is already writing to?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
user997112
  • 29,025
  • 43
  • 182
  • 361
  • 2
    Consider the difference between a locked and an unlocked increment: if two cores simultaneously increment a value 0 in memory, they both decide to write 1 to the memory. Cache coherency doesn't prevent the loss of one of the updates. – Kerrek SB Apr 26 '15 at 17:00
  • @KerrekSB sorry, I didn't quite understand what you meant. Wouldn't cache coherency ensure only one of the cores owned the cacheline being written to? Therefore if the second core went to modify the cacheline whilst the first was incrementing, MESI would prevent the write until the first core had completed the increment? – user997112 Apr 26 '15 at 17:56
  • 8
    Incrementing isn't an atomic operation. First both cores read the same value (0) off a clean cache line. Then they both decide compute the new value (1). Then they both write the value back. At that writing stage, cache coherency doesn't help because writing to memory is not immediately affected by coherency (only *reading* from dirtied pages is). – Kerrek SB Apr 27 '15 at 07:39

2 Answers2

11

The MESI protocol makes the memory caches effectively invisible. This means that multithreaded programs don't have to worry about a core reading stale data from them or two cores writing to different parts of a cache line and getting half of one write and half of the other sent to main memory.

However, this doesn't help with read-modify-write operations such as increment, compare and swap, and so on. The MESI protocol won't stop two cores from each reading the same chunk of memory, each adding one to it, and then each writing the same value back, turning two increments into one.

On modern CPUs, the LOCK prefix locks the cache line so that the read-modify-write operation is logically atomic. These are oversimplified, but hopefully they'll give you the idea.

Unlocked increment:

  1. Acquire cache line, shareable is fine. Read the value.
  2. Add one to the read value.
  3. Acquire cache line exclusive (if not already E or M) and lock it.
  4. Write the new value to the cache line.
  5. Change the cache line to modified and unlock it.

Locked increment:

  1. Acquire cache line exclusive (if not already E or M) and lock it.
  2. Read value.
  3. Add one to it.
  4. Write the new value to the cache line.
  5. Change the cache line to modified and unlock it.

Notice the difference? In the unlocked increment, the cache line is only locked during the write memory operation, just like all writes. In the locked increment, the cache line is held across the entire instruction, all the way from the read operation to the write operation and including during the increment itself.

Also, some CPUs have things other than memory caches that can affect memory visibility. For example, some CPUs have a read prefetcher or a posted write buffer that can result in memory operations executing out of order. Where needed, a LOCK prefix (or equivalent functionality on other CPUs) will also do whatever needs to be done to handle memory operation ordering issues.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • "The MESI protocol won't stop two cores from each reading the same chunk of memory ... turning two increments into one." I'm confused by this statement, as it seems to be in contradiction to what Paul McKenney writes [in this article](http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf). For example, MESI transitions (e) and (f) on page 5 of the article are caused by RMW operations. Moreover, if two CPUs attempt to invalidate the same cache line simultaneously, then according to the article, one will succeed in putting the cache line in the M state before the other. – void-pointer Apr 11 '16 at 08:32
  • @void-pointer I'm not sure what to tell you other than to read my answer again. Suppose two CPUs are doing an increment. They both read, they both increment, then they both try to write. How does the fact that one write will occur before the other prevent a lost increment? Clearly it doesn't. You need to lock the cache line during the increment, and that's what the LOCK prefix does. Transition (e) is an atomic read-modify-write operation, and that's precisely what the question was asking why we need. And the answer is what ordering the writes isn't enough. – David Schwartz Apr 11 '16 at 08:47
  • I apologize for the stupid questions -- I'm trying to learn more about cache coherency. To my understanding, a CPU must have a cache line in the M state in order to modify it, and exactly one CPU can have a given cache line in the M state at a time. So I don't see how both CPUs could simultaneously increment the same cache line. If two CPUs attempt the RMW simultaneously, then (to my understanding) one will "win" and force the other to invalidate its copy of the cache line. The loser would then have to issue a read-invalidate in order to gain exclusivity before performing its RMW. – void-pointer Apr 11 '16 at 09:44
  • In particular, I don't see how step (2) of the "unlocked increment" procedure in your answer is compatible with MESI (at least as described in McKenney's article). Isn't a CPU prohibited from incrementing a cache line in the shared state? – void-pointer Apr 11 '16 at 09:50
  • 1
    @void-pointer It doesn't increment the cache line, it increments the value it read. Each CPU does a read and reads a value, the cache line is now shared. Then each CPU increments the value it read. Then each CPU does a write (getting the cache line exclusively in turn). This is what a normal increment is -- a read, a modify, and then a write. By contrast, a locked increment is an atomic read-modify-write operation that locks the cache line during the increment. – David Schwartz Apr 11 '16 at 09:51
  • 5
    Thanks! Now I see what you mean -- a CPU is free to increment the value of the variable in its register without modifying the cache line. Then each can write the result in succession. – void-pointer Apr 11 '16 at 09:52
  • Thanks @DavidSchwartz for the answer! From your answer, I understand that Lock prefix is used to lock the cache line, but on the other hand when reading the top answer in this link https://stackoverflow.com/questions/3339141/x86-lock-question-on-multi-core-cpus, it seems "cache locking" is used to realize Lock prefix (in P6 and more recent processors). So my question is, which one comes first? Which is used to do what? – jack Jul 29 '20 at 00:30
  • @jack Modern CPUs have heavily focused on inter-core optimizations because multicore systems are now the norm rather than the exception. On modern CPUs, only the particular cache line will be locked. – David Schwartz Jul 29 '20 at 01:11
  • @DavidSchwartz Thanks for the answer. So what's confusing me is, from the developer manual of intel it says in modern CPU, Lock prefix relies on "cache coherency mechanism to insure that the operation is carried out atomically", however, when we explain cache coherency, we say "cache coherency relies on Lock prefix when writing back to main memory" (see https://stackoverflow.com/questions/3339141/x86-lock-question-on-multi-core-cpus), so it's like a loop here. `Should I say cache coherency uses Lock prefix, or Lock prefix uses cache coherency`? – jack Jul 29 '20 at 07:51
  • @jack When the lock prefix is used, the cache coherency system locks the cache line during the operation so that it is effectively atomic. Cache coherency doesn't really rely on the lock prefix -- it doesn't care whether the result of an operation is correct or incorrect, it just does what it's told to do. – David Schwartz Jul 29 '20 at 08:08
  • @DavidSchwartz Thanks! Perhaps I mixed up the concepts, `Lock prefix` and `Lock# signal`, maybe it would be correct to say "`Lock prefix uses, either Lock# signal to lock bus, or cache coherence without Lock# signal to lock cache line (or say, cache).`", or say, in some particular cases (when the area of memory being locked is completely contained in a cache line) of modern CPU, Lock prefix == cache coherence, Is this correct? – jack Jul 29 '20 at 15:15
  • 1
    I wouldn't say "lock prefix == cache coherence". A cache can be perfectly coherent but an increment can be implemented as a non-atomic read/modify/write sequence. The lock prefix causes the cache coherency mechanism to be ordered to lock the cache line during that entire sequence, something not required for the cache to merely be coherent. – David Schwartz Jul 29 '20 at 15:39
  • 2
    "Should I say cache coherency uses Lock prefix, or Lock prefix uses cache coherency?" My belt is supposed to holds up my pants. But my pants have loops to hold my belt. Who is the real hero? (RIP, Mitch Hedberg. I used to miss him. (I still do, but I used to, too.)) – David Schwartz Jan 05 '21 at 10:06
  • 1
    @jack: Intel's manuals still talk about bus locks and an external Lock# signal, but that's ancient history for atomic RMWs **on WB cacheable memory**, aligned or at least not splitting between two cache lines. A cache-line-split `lock`ed instruction *does* have to do something disastrously slow that impacts memory access for all cores. There's a performance counter just for that, and recent CPUs let the OS make that fault to help them detect accidental cases. But in the normal case an atomic RMW just has to delay MESI response, no extra off-core traffic. – Peter Cordes Sep 22 '22 at 02:15
0

Yep, you're confusing two different things. The MESI protocol is a cache-coherence protocol that ensures each core/processor gets the most up-to-date data from other processors' cache (or mem) when requested. If a cacheline is in 'E' state, that tells the requesting processor that one (and only one) other processor has a copy of this line. That's all it does; the 'E' state does not in anyway prevent the requesting processor from accessing the data; it just states the fact that only one processor has a copy of the data (and that copy is also consistent with that is in memory). So, if a core request the data which is in 'E' state, the core will get a copy of it. The other copy which was in 'E' will be changed based on whether the core is requesting the copy for 'writing' or 'reading'. If it is being requested for writing, the old copy will be invalidated ('I' state) and if it's for reading the old copy will be put into the shared 'S' state.

waleed
  • 214
  • 2
  • 5
  • You got the functionality of the LOCK prefix wrong. There's no harm to two simultaneous writes -- with or without LOCK, you'll see one of them and it's arbitrary which. – David Schwartz Jan 08 '16 at 21:22
  • @DavidSchwartz Thank you David for the comment, I removed that part of the answer. By the way, my comment about the LOCK prefix being used in implementing mutual exclusion was correct. Also, more importantly, the order of writes can NOT be arbitrary as you mentioned and has to follow the memory consistently model of the architecture; otherwise memory won't be consistent (from the program and programmer's view point). – waleed Jan 12 '16 at 18:00