3

enter image description here I can't figure out the width of bus between cpu and cpu cache in modern PC's. I didn't find anything reliable in the internet. All what I have is a block diagram for Zen (AMD) microarchitecture, which says that, L1 and L2 caches can transfer 32B(256b) per single cycle. I'm guessing the bus width is 256 lines (assuming single data rate). But, there are double data rate transfers, like between memory controller and DDR memory.

Summarizing:

  1. Is the bus width between cpu and cpu cache 256 lines?
  2. If yes, does that means, that read entire cache line from L1 requires two cpu cycles?
Jakub Biały
  • 391
  • 2
  • 16
  • 1
    The path between L2 and L1d is between two levels of CPU cache, not the load/store execution units. (Which are 128-bit wide in Zen, so it has to split 256-bit AVX loads/stores into 2 uops, somewhat like Intel Sandybridge did. But note that it can do 2x 128-bit loads per clock.) – Peter Cordes Aug 03 '20 at 18:43
  • 1
    The diagram clearly says 32B/cycle. (That would be burst throughput, not latency, and not necessarily sustained throughput). Whether that's 128-bit double-pumped or 256-bit normal is not specified here, although 256-bit normal is likely for this internal connection inside a single CPU core. – Peter Cordes Aug 03 '20 at 18:46
  • So, if I understand correctly, whether it is 128 or 256 lines, I can assume that the effective bus width is 256 bits? I mean bus width between Integer/FP (ALU and FPU ?), which buses are not described. – Jakub Biały Aug 03 '20 at 18:58
  • 1
    Note the light-blue / turquoise "2x 128-bit" path from L1d -> FP/SIMD load unit, (labeled LDCVT). This is Zen1, not Zen2, so the load execution units are only 128 bits wide. I'm surprised by the 32B/cycle label on the store path; IDK what that's about. `vmovaps [rdi], ymm0` is 2 uops on Zen1. https://uops.info/. – Peter Cordes Aug 03 '20 at 19:11
  • 2
    If you're looking for info about other x86 microarchitectures, https://www.realworldtech.com/haswell-cpu/ has very good stuff about Haswell, and https://www.realworldtech.com/sandy-bridge/ is similarly excellent for Sandybridge. David Kanter hasn't publicly published a similar deep dive for Zen or Skylake, unfortunately, just for Bulldozer and older. His Zen deep dive is behind a paywall: https://www.linleygroup.com/newsletters/newsletter_detail.php?num=5577. I think it was public at one point; https://www.realworldtech.com/forum/?threadid=164709&curpostid=164709, maybe in archive.org – Peter Cordes Aug 03 '20 at 19:17

1 Answers1

5

This kind of information can be found in the optimization manuals from Intel and AMD, but usually in terms of port bandwdith, not exactly width, because that's what most people care about.

The L1D cache in the Zen microarchitecture has 16 banks and 3 128-bit ports, two of which can handle load-type requests and one can handle store-type requests. So the maximum core-L1D bandwdith is 128*3 bits per cycle. In Zen 2, the ports were expanded to 256 bits/c each and the number of banks was cut by half. So the maximum core-L1D bandwdith in Zen 2 is 256*3 bits per cycle, but the chance of achieving max bandwdith is lower.

Consider Ice Lake as an example from Intel processors. The L1D cache has 4 ports, two 512-bit loads and two 256-bit stores. The store ports can either handle a single 512-bit store request per two cycles or two 256-bit store requests per cycle but only if the two stores are fully contained within the same cache line and have the same memory type. It appears to me that these two store ports are implemented actually as a single 256-bit wide store port with dual store merging capability. So the total number of true ports from the core side seems to be 3.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • In the context of Zen L1D cache, does it mean that for the load from cache to core, maximum bandwidth is 256 bits/c? – Jakub Biały Aug 03 '20 at 19:49
  • @user12292000 128 multiplied by 3 is 384 bits/c. – Hadi Brais Aug 03 '20 at 19:51
  • Yes, but you said, that there are two 128-bit ports, which can handle load-type requests. – Jakub Biały Aug 03 '20 at 20:06
  • 1
    @user12292000 Oh yeah I just noticed you're talking about loads only. Right. – Hadi Brais Aug 03 '20 at 20:08
  • 1
    For Haswell/Broadwell, the only model that I could invent that explained my measurements hypothesized that the L1D cache has two full-cache-line (64-Byte) read ports. Each of these ports has access to full cache lines and can deliver up to 32 contiguous Bytes (with any alignment) from that line to one of the two core cache read ports. There are probably more complex hypothetical implementations that could match the data, but this matches all my observations (for ready-only behavior -- writes are more confusing). – John D McCalpin Aug 05 '20 at 17:20
  • @JohnDMcCalpin It appears to me that there is some kind of contention between the load ports and the store port. You can find more details about my experiments [here](https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop). Although that post is mostly about Ivy Bridge, but the same observation seems to apply to Haswell as well, even though Intel gives the impression in the opt manual that all the three ports are independent from each other, which I don't think is completely true. – Hadi Brais Aug 05 '20 at 17:50