I was wondering what benefits MOESI has over the MESI cache coherency protocol, and which protocol is currently favored for modern architectures. Oftentimes benefits don't translate to implementation if the costs don't allow it. Quantitative performance results of MOESI over MESI would be nice to see also.

- 328,167
- 45
- 605
- 847

- 3,437
- 2
- 24
- 25
-
1MOESI and MESI just specify the (stable) states and transitions between them. But there are many ways to implement them (invalidation vs. update, directory vs. snooping vs. hybrid, design of transactions, design of the cache hierarchy). The only way to fairly compare the performance of these two protocols is by using a single real processor that implements both. There is no such processor AFAIK. Even then, the comparison would be just between two of the many different implementations of the protocols. – Hadi Brais Apr 24 '18 at 01:49
2 Answers
AMD uses MOESI, Intel uses MESIF. (I don't know about non-x86 cache details.)
MOESI allows sending dirty cache lines directly between caches instead of writing back to a shared outer cache and then reading from there. The linked wiki article has a bit more detail, but it's basically about sharing dirty data. The Owned state keeps track of which cache is responsible for writing back dirty the data.
MESIF allows caches to Forward a copy of a clean cache line to another cache, instead of other caches having to re-read it from memory to get another Shared copy. (Intel since Nehalem already uses a single large shared L3 cache for all cores, so all requests are ultimately backstopped by one L3 cache before checking memory anyway, but that's for all cores on one socket. Forwarding apply between sockets in a multi-socket system. Until Skylake-AVX512, the large shared L3 cache was inclusive. Which cache mapping technique is used in intel core i7 processor?)
Wikipedia's MESIF article (linked above) has some comparison between MOESI and MESIF.
AMD in some cases has lower latency for sharing the same cache line between 2 cores. For example, see this graph of inter-core latency for Ryzen vs. quad-core Intel vs. many-core Intel (ring bus: Broadwell) vs. Skylake-X (worst).
Obviously there are many other differences between Intel and AMD designs that affect inter-core latency, like Intel using a ring bus or mesh, and AMD using a crossbar / all-to-all design with small clusters. (e.g. Ryzen has clusters of 4 cores that share an L3. That's why the inter-core latency for Ryzen has another step from core #3 to core #4.)
BTW, notice that the latency between two logical cores on the same physical core is much lower for Intel and AMD. What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?.
I didn't look for any academic papers that simulated MESI vs. MOESI on an otherwise-similar model.
Choice of MESIF vs. MOESI can be influenced by other design factors; Intel's use of a large tag-inclusive L3 shared cache as a backstop for coherency traffic is their solution to the same problem that MOESI solves: traffic between cores is handled efficiently with write-back to L3 then sending the data from L3 to the requesting core, in the case where a core had the line in Modified state in a private L2 or L1d.
IIRC, some AMD designs (like some versions of Bulldozer-family) didn't have a last-level cache shared by all cores, and instead had larger L2 caches shared by pairs of cores. Higher-performance BD-family CPUs did also have a shared cache, though, so at least clean data could hit in L3.

- 328,167
- 45
- 605
- 847
-
3re: AMD uses MOESI, Intel uses MESIF: Intel has used SI for the L1 instruction cache and the MESI of the L1 data and L2 cache until the Pentium 4 where all caches use MESI. At least since Nehalem and up to Haswell, [MESIF](https://patents.google.com/patent/US6922756B2/en) was used. In Haswell-EP, sophisticated variants of MESIF were implemented. I'm not sure about more recent processors. Most if not all AMD processors since AMD K8 up to AMD Bulldozer use MOESI. However, starting with AMD Bulldozer, a more sophisticated [protocol](https://patents.google.com/patent/US8732410B2/en) has been used. – Hadi Brais Apr 24 '18 at 01:49
-
1AMD does not own a patent for MOESI. Apparently no one does. Many [Oracle processors](https://en.wikipedia.org/wiki/SPARC) use MOESI (sometimes with MESI at different cache levels). The fact that Intel owns MESIF and is still using it indicates that Intel is very happy with it and is making variants out of it, while other vendors are stuck with MOESI. – Hadi Brais Apr 24 '18 at 02:56
-
You mentioned L3 on Skylake isn't inclusive. I am curious under what circumstances l3 will include what l2 have and when it doesn't. Thanks – HCSF Jan 26 '20 at 03:59
-
1@HCSF: It's not exclusive either ([NINE](https://en.wikipedia.org/wiki/Cache_inclusion_policy#NINE_Policy)), so after an L1/L2/L3 miss, they will all contain a line. But a line can be evicted from L3 (e.g. to make room for a load-miss from another core) without evicting it from inner caches. Note that's *only* for Skylake-server (SKX with AVX512), not for Skylake-client (SKL). – Peter Cordes Jan 26 '20 at 15:43
-
1I think the F is actually only useful with multi sockets when you have an inclusive L3 like you mention, because there doesn't need to be a dedicated core for shared states because it's read out of the L3. – Lewis Kelsey May 10 '20 at 17:50
-
how is the cache line becomes 'owned' in the first place? when the PE modified it first among all the PEs sharing the line? (in MOESI), so the first PE that has written the line in cache has the obligation to update to main memory? and how come other PEs cannot write the line which is owned by another PE? This is a little hard to understand. – Chan Kim Jul 20 '22 at 09:36
-
@ChanKim: The MOESI Owned state happens when one core has a line in Modified state and another core wants to read it. Instead of doing a write-back and moving to Shared state (MESI), MOESI can share the dirty data. According to the wiki page, a core can still write to an Owned line, which seems weird to me. That allows cache itself to temporarily be out of sync, and if you wanted to block StoreLoad reordering (delay a load until after a store was globally visible), `mfence` or something might have to wait for the changes to be broadcast to other cores, not just for it to commit to L1d? – Peter Cordes Jul 20 '22 at 14:05
-
@ChanKim: *how come other PEs cannot write the line which is owned by another PE?* - That would not maintain *coherency*, defeating the entire purpose of the protocol! MESI is designed around the idea that a core has to get exclusive ownership of a cache line before it can update it. Otherwise you could have two conflicting versions of the same line in different caches! And no way to know how to merge them to get any sane final state. Consider an atomic increment, and what would happen if two separate cores both did the same increment on the same location. – Peter Cordes Jul 20 '22 at 14:08
-
@PeterCordes I see, now I can understand when the line becomes 'Owned' by a core and by the rule to keep the coherency, other cores can't write to the line(the states of the 'Owned' line in the owner core is seen as 'Shared' in the other cores I guess?). Thanks alot! – Chan Kim Jul 21 '22 at 07:16
MOESI is almost always superior to MESI in terms of absolute performance. However, MESI only requires 2 bits per cache line to hold the state, while MOESI requires 3 bits per cache line. Therefore, for smaller cache lines, the relative area overhead of MOESI increases. This may not be justified when the type of applications in the target domain exhibit very little writes to shared cache lines. Even the additional power or static energy overhead may not be tolerable in certain domains. For these reasons, MOESI might be too expensive for low-energy/low-performance/small processors. That is, MOESI would be less efficient in terms of performance-per-watt or performance-per-joule. ARM11 uses MESI. ARM Cortex-A57 uses MESI at L1 and MOESI at L2. Note that the decision of using a particular coherence protocol is not made independently of making decisions regarding other aspects of the the cache hierarchy, the interconnect, and the number of cores. These parameters influence each other.

- 22,259
- 3
- 54
- 95
-
1When active power dominates, for at least some multithreaded workloads, avoiding cache line ping pong (read for ownership) when the writer/producer is fixed would be more important than saving some static power from one bit per cache tag. (One bit tag size increase also increases area modestly.) Workloads on non-server multiprocessor ARM systems are probably more frequently multiprogrammed rather than multithreaded (i.e., in general less communication between cores). I like that you brought out the interaction of tradeoffs. – Aug 01 '22 at 10:54