How caches are connected to cores?

Question

I have very fundamental question on how physically (in RTL) caches (e.g. L1,L2) are connected to cores (e.g. Arm Cortex A53)? How many read/writes ports/bus are there and what is width of it? Is it 32-bit bus? How to calculate theoretical max bandwidth/throughput on L1 cache connected to Arm Cortex A53 running at 1400MHz?

On web lots of information is available on how caches work but couldn't find how it is connected.

score 2 · Accepted Answer · answered Jul 24 '22 at 14:40

2

You can get the information in the ARM documentation (which is pretty complete compared to others):

L1 data cache:

(configurable) sizes of 8KB, 16KB, 32KB, or 64KB.
Data side cache line length of 64 bytes.
256-bit write interface to the L2 memory system.
128-bit read interface to the L2 memory system.
64-bit read path from the data L1 memory system to the datapath.
128-bit write path from the datapath to the L1 memory system.

Note there is one datapath since it is mentioned when there are multiple of them, hence there is certainly 1 port unless 2 ports share the same datapath which would be surprising.

L2 cache:

All bus interfaces are 128-bits wide.
Configurable L2 cache size of 128KB, 256KB, 512KB, 1MB and 2MB.
Fixed line length of 64 bytes.

General information:

One to four cores, each with an L1 memory system and a single shared L2 cache.
In-order pipeline with symmetric dual-issue of most instructions.
Harvard Level 1 (L1) memory system with a Memory Management Unit (MMU).
Level 2 (L2) memory system providing cluster memory coherency, optionally including an L2 cache.
The Level 1 (L1) data cache controller, that generates the control signals for the associated embedded tag, data, and dirty RAMs, and arbitrates between the different sources requesting access to the memory resources. The data cache is 4-way set associative and uses a Physically Indexed, Physically Tagged (PIPT) scheme for lookup that enables unambiguous address management in the system.
The Store Buffer (STB) holds store operations when they have left the load/store pipeline and have been committed by the DPU. The STB can request access to the cache RAMs in the DCU, request the BIU to initiate linefills, or request the BIU to write out the data on the external write channel. External data writes are through the SCU.
The STB can merge several store transactions into a single transaction if they are to the same 128-bit aligned address.

An upper-bound for the L1 bandwidth is frequency * interface_width * number_of_paths so 1400MHz * 64bit * 1 = 10.43 GiB/s from the L1 (reads) and 20.86 GiB/s to the L1 (writes). In practice, the concurrency can be a problem but it is hard to know which part of the chip will be a limiting factor.

Note that there are many other documents available but this one is the most interesting. I am not sure you can get the physical information about cache in RTL since I expect this information to be confidential, hence not publicly available (because I guess competitors could take benefit of this).

answered Jul 24 '22 at 14:40

Jérôme Richard

41,678
6
29
59

Thanks @Jerome! So, if I run lmbench3 fwr bw_mem workload where the accesses are non-coherent and terminating at the L1 cache level, am I expected to get throughput closer to 20GB/s? – Nee Jul 24 '22 at 15:14
2

@Nee This is the maximum throughput you can get per core in write mode (assuming everything is perfect, including the benchmark/hardware/OS: no paging overhead, no write allocate, no task interruption, proper alignment, etc.). In practice, it will certainly a bit less but yeah it should be something like 16-20 GiB/s (note the `i` in GiB, GiB!=GB). – Jérôme Richard Jul 24 '22 at 15:49
Yeah makes sense. Oh! I ignored the standard of the IEC :) – Nee Jul 24 '22 at 15:56
Any idea why the write paths are wider than the read paths? I'd have guessed it would be the other way around, with wider read paths to get cache lines in sooner. And it's not rare to read data without ever modifying it. These CPUs do normally use write-back caches, right? – Peter Cordes Jul 24 '22 at 18:07
@Peter, Yes it generally uses write back read write allocate. For Normal memory, Arm cores can merge the acceses while writing into the main memory. May be this is the reason behind wider write data path. – Nee Jul 24 '22 at 21:10
@Nee: AArch64 `ldp` of 2x 64-bit registers can already do a 128-bit load on Cortex-A53. Of course it's not *atomic* on CPUs before ARMv8.4a, which is why A53 can implement it as two 64-bit halves. And sure, not all loads are 16-byte pairs or SIMD loads, so probably a 64-bit load path isn't a big slowdown a lot of the time. But none of that explains the narrower path between L1d and L2. That's always transferring whole cache lines upon eviction, except for MMIO accesses (like writing to video RAM). – Peter Cordes Jul 24 '22 at 22:01
@Nee: Maybe part of the critical path for a load is evicting a (potentially dirty) cache line before L1d can even start receiving the clean line? If the old line has to be all gone before it can even start receiving, a wide path can help there, and then critical-word-first load can get the necessary part of the cache line ready for early restart of the pipeline before the whole line comes in. Without out-of-order exec, memory-level parallelism is presumably more limited than the modern x86 CPUs I'm used to thinking about. – Peter Cordes Jul 24 '22 at 22:03
This is quite surprising indeed. AFAIK, A wide write path is useful for write through caches but the Cortex A-53 indeed use a WB write allocate L1 (see section 6.2.5). My guess is that such processor are clearly not optimized for SIMD loads/stores and I think it is reasonable to say that random reads are quite frequent compared to random writes. As a result, it make sense to prefer writes merges over reads merges (ideally both, but this is a low-power processor). – Jérôme Richard Jul 25 '22 at 01:45
I also guess merging scalar reads would certainly introduce a non-negligible additional latency (eg. >=1 cycle) due to the in-order execution (mainly for random reads). The STB should also add such a latency for writes, but the latency of writes are generally not an issue because it is quite rare to have instructions waiting on them. – Jérôme Richard Jul 25 '22 at 01:58
Arm Cortex A53 supports speculative instructions read, but not execution. – Nee Jul 25 '22 at 08:51
@JérômeRichard: I don't think you'd every try to coalesce loads (except for a load-pair instruction of course); you want the minimum amount of buffering between sending out the address and getting the result, for latency reasons. Unlike for stores, where you have a store buffer that keeps stores sitting around anyway. But loads read directly from L1d or store-forwarding, not from a buffer. (Cache-miss loads can wait for the same line). So yes, that can explain the load/store execution unit widths, but not the L1d <-> L2 widths. Scattered read-only accesses do lots of reads but no writes. – Peter Cordes Jul 26 '22 at 18:59
Store-forwarding solves the store latency problem, except for loads that only *partially* overlap with a previous store. I mean, store-to-load forwarding does have latency, but it's constant, just probing the store buffer as a CAM for every load. You're right that latency from store execution until commit to L1d is usually a non-issue. And I think store-buffer merging of stores is usually just at the commit end, like looking at the oldest 2 stores, and doing them both if they're to the same chunk. Actually yeah, enabling coalescing could explain a wider store path. – Peter Cordes Jul 26 '22 at 19:01
[Are there any modern CPUs where a cached byte store is actually slower than a word store?](https://stackoverflow.com/q/54217528) quotes some CPU manuals about store merging. The Alpha 21264 manual it links is very detailed about the rules for merging, implying a lot about the machinery. Also a Cortex-A15 MPCore manual. But the ECC granularity is probably still only 4 bytes on a modern ARM, to avoid an RMW cycle for word stores, so merging doesn't *need* to be wider than that. But it does help drain the SB faster to free up more space if it got full. – Peter Cordes Jul 26 '22 at 19:05

How caches are connected to cores?

1 Answers1