Who actually does the out of ordering of the memory accesses in MPCore?

Question

As per my current understanding from the ARM Cortex A57 and A78 TRM, micro ops can be issued out of order to 1 among the several execution pipelines.

This is instruction reordering for independent instruction as far as I understood.

Memory access reordering is something which means observers and slaves in a system may observe memory accesses in different sequence compare to the program sequence. This could mean 1 of the following -

1 - CPU reordered the memory access micro ops and issued to the load and store pipelines. Interconnect(ACE/CHI) did not do any reordering

2 - CPU issues the micro-ops in program order but Interconnect(ACE/CHI) reordered it

Is my understanding correct? If yes, then will the barrier instruction halt the CPU pipeline by stopping further instruction issue or Interconnect throttles the CPU master interface till Barrier instruction response is received?

I asked in ARM blog but no response as of now.

https://community.arm.com/support-forums/f/architectures-and-processors-forum/54529/who-actually-does-the-out-of-ordering-of-the-memory-accesses-in-mpcore

EDIT 1

As per suggestion by Peter, I wanted to mention following precondition for my question -

1 - Multi cluster ARM SoC along with other ACE masters like DMA enginer, iGPU, etc.

2 - The question is for inner-shareable as well as outer shareable memory (eg - Memory accessed by threads running in different CPU cluster)

3 - Question is for Cacheable (This is clarified by Peter to a great extent) and Non-Cacheable normal memory as I wanted to understand how memory access observation by other observers is related to ordering in CPU pipeline in out of order pipeline architecture such as ARM Cortex A78

Peter Cordes · Answer 1 · 2023-05-04T21:06:01.027

1

Memory reordering (of access to globally-visible cache state) happens inside the CPU core, not the interconnect. A barrier instruction doesn't send any messages to other cores.

(At least not dmb ish. I don't know about outer-shareable / non-cache coherent stuff, but those barriers might just order things wrt. cache-control instructions that you also need in those cases. The A32/T32 and A64 docs sound to me like even for stronger orders, it's still just about waiting for completion of things that were already going to happen because of other instructions, including loads or stores. There are probably more detailed docs somewhere, but maybe an ARM expert can shed some more light on this with another answer if this answer is missing anything important.)

Issuing a load micro-op to an execution unit attempts to read from cache right then. But issuing a store just copies the data+address to the store buffer. Memory reordering (of their accesses to coherent shared cache) happens inside each core, by various mechanisms including the store buffer and hit-under-miss non-blocking caches.

Out-of-order execution is one significant mechanism for LoadLoad reordering (if load addresses are ready in a different order), but all major kinds of memory reordering can happen on an in-order pipeline, due to cache miss loads and a store buffer. (And if the store buffer allows out of order commit of stores, which ARM normally would since its memory model doesn't guarantee StoreStore ordering.)

My understanding is that interconnects generally don't introduce reordering themselves. So memory barriers just have to make things inside this core wait until earlier loads have completed and/or the store buffer drains.

See also:

https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ - analogies for memory reorderings in terms of out-of-order accesses to a coherent shared state.
Can a speculatively executed CPU branch contain opcodes that access RAM? - store buffers decouple execution from committing stores to cache, allowing speculative execution among other benefits.
How does memory reordering help processors and compilers? - CPUs want to load early and store late.
Does a memory barrier ensure that the cache coherence has been completed? - No, cache stays coherent all the time, memory barriers just order the global visibility of this core's memory operations. Not just their execution order (in terms of actually running on execution units).
How is load->store reordering possible with in-order commit?
Can CPU Out-of-Order-Execution cause memory reordering?

edited May 04 '23 at 21:06

answered May 04 '23 at 16:04

Peter Cordes

328,167
45
605
847

Thanks for the detailed reply. About the interconnect part, "So memory barriers just have to make things inside this core wait until earlier loads have completed and/or the store buffer drains.", what is the point of "DMB OSH" then? ARM manual consistently talks about observers which could be the observers inside the core (I-Side, D-Side, MMU) or CPU clusters, DMA engines. ACE has barrier transaction support for blocking transactions till the transaction before barrier completes. Also what happens when independent memory loads are from non-cacheable memory? – Shaibal May 04 '23 at 16:36
@Shaibal: Ok, I was only thinking of the kinds of memory barriers that are used for stuff like `std::atomic`, where all threads are on cores in the same inner-shareable domain, thus cache coherent. I don't know the ARM details about outer-shareable and what more `dmb sy` or `dmb osh` has to do. This is kind of a generic answer that I think is true for all systems with coherent cache between all cores, but might not be accurate outside that. If there's more to say, hopefully an ARM expert will fill in the details. (But a big.LITTLE cpu with A57 and A78 cores will have them all coherent.) – Peter Cordes May 04 '23 at 16:56
1

@Shaibal: Cache-miss stores don't commit to L1d (and become visible to other cores) until after the RFO (read-for-ownership) completes (giving the old value of the cache line, whether that came from DRAM or dirty last-level cache). The ordering of DRAM accesses for stores isn't directly part of the memory order observed by other cores, it's the coherency protocol and order of commit from store buffer to L1d that determines that. – Peter Cordes May 05 '23 at 04:29
1

@Shaibal: Also note that ARMv8 requires systems to be multi-copy atomic, so all cores agree on the order of stores done by two independent cores. (Unlike PowerPC which can do [IRIW reordering](https://stackoverflow.com/questions/27807118/will-two-atomic-writes-to-different) in practice, vs. ARMv7 only on paper.) Also, https://developer.arm.com/documentation/ka002179/latest discusses barriers outside CPU cores, and says ARMv8 systems don't need them (because they're required to already have the Multi_Copy_Atomicity property.) That might be the kind of ARM-specific answer you're looking for. – Peter Cordes May 05 '23 at 04:33
Thanks @Peter. Sorry I deleted my comment thinking it may not make sense and I was trying to come up with a different comment and then saw your 2 replies. :) I will go through the link you shared. Also sorry that I haven't accepted your answer yet as it did not clarify some of the things in ARM arch. Maybe with discussion in comment I can clarify all the points with you :) – Shaibal May 05 '23 at 04:53
1

@Shaibal: No worries, as I said I'm not sure it 100% covers everything that might be relevant on ARM considering stuff other than memory-ordering between CPU cores for normal cacheable memory. (i.e. what matters for C++ std::atomic with std::thread). You should probably edit your question to highlight that you're asking about things other than inner-shareable cacheable memory. – Peter Cordes May 05 '23 at 04:58
Hi @Peter In case you have bandwidth, I have created a chat room so that we can continue there and avoid adding too many comments. Room - "Discussion on ARMv8 CPU architecture" I have sent you invitation too – Shaibal May 05 '23 at 05:57

Who actually does the out of ordering of the memory accesses in MPCore?

1 Answers1