Can x86 take advantage of DDR5's 32 bit channels during random access?

Question

DDR5's architecture has replaced each of DDR4's 64 bit wide data bus with two 32 bit wide data busses.

Based on this, one would hope that DDR5 can randomly access twice as many entries when working with large tables (> L3) which have individual entries of size <= 32 bits.

However, I've written a benchmark which does a certain amount of random accesses to a table of fixed total byte size of uint32's and uint64's. The benchmarks suggest that my previous assumption is not valid:

benchmark code: https://gist.github.com/L0laapk3/049215bc02e5434b55528955e3e29f11

(Tested on a 13900kf with 4x 5000MHz CL34-38-38 DDR5 ram running gear 2, cpu affinity set to only performance cores)
(compiled using clang 15.0.5 with -Ofast -march=alderlake)

Is there a flaw in my testing methodology, or is my understanding of RAM lacking?

The CPU will still fetch whole cache lines of 64 bytes, using 16 transfers bursts instead of 8. Technically it can now fetch 2 cache lines simultaneously. Or rather, 4 instead of 2 in case of dual channel I suppose. — ElderBug, Nov 26 '22 at 20:25
@ElderBug I suspected something like that might have been the case, but if this is true then I don't comprehend DDR5's change to 32 bit data lanes? — L0laapk3, Nov 26 '22 at 20:31
I think you still got the rough underlying idea right. It helps reduce latency in some cases, because changing addresses is costly. Now you have twice the ports with independent addresses, so possibly you can have two independent sequential streams at no cost. I'm not sure of the minute details. Maybe it also helps exploit chip-level dual port to avoid bank conflicts. It's just that you can't test any of that just comparing 4-bytes and 8-bytes accesses. Most accesses on the bus are 64-bytes. — ElderBug, Nov 26 '22 at 20:42
You'd have to allocate pages of uncacheable memory to even test this. Reading from normal cacheable memory (like `new` will use, thus `std::vector`) will fill the whole 64-byte line. But you won't find a difference; the minimum burst length is 32 bytes, I think. https://en.wikipedia.org/wiki/DDR5_SDRAM shows write commands having only one bit for a `BL` signal to specify a burst length other than the default 16 cycles (64 bytes). And memory-level parallelism for a single core is probably more limited by the core (LFBs and superqueue entries) than by the memory controller itself. — Peter Cordes, Nov 26 '22 at 22:11
I'd guess part of the reason for the change is to narrow the amount of bits that need to be exactly in sync with each other; as clock speeds keep climbing, the timing tolerances get tighter for parallel buses where all bits are clocked from the same clock. (Unlike PCIe where you have multiple self-clocked serial links.) But maybe not, it's still a parallel bus with probably only one clock, not independent clocks for the two halves(?). More likely ElderBug's point is the more important one: hiding command latency by creating parallelism. Latency in clocks gets rather high at high frequency — Peter Cordes, Nov 26 '22 at 22:18
@PeterCordes Regarding the saturation of the RAM by 1 core, Raptor Lake CPUs are based on Raptor Cove which is very close to Golden Cove having pretty large buffers. For example, le load queue contains 192 entries while skylake had only 72 entries. I also expect a similar jump for the LFB but it seems this information is not provided by Intel (and I cannot find this information). The superqueue entries seems undocumented too. Thus, It might be enough to saturate the 2 channels @ 5000 MHz providing a bandwidth of 74.5 GiB/s. — Jérôme Richard, Nov 27 '22 at 01:06
@JérômeRichard: Saturating bandwidth for sequential accesses has help from the prefetcher. I was thinking that saturating whatever throughput limit the DRAM controllers have for random access might be worse, but it might only be the difference between LFB vs. superqueue (given that the main prefetcher is in L2). And with extra time for switching between DRAM "pages", that might actually slow things down vs. back-to-back burst transfers. So now I'm not too sure. Client CPUs can normally come pretty close to saturating sequential read/write memory bandwidth, so might for this, too. — Peter Cordes, Nov 27 '22 at 01:27
@PeterCordes Very interested in learning this level of detail about RAM. Is it still worth memorising "What Every Developer Should Know About Memory", or have things changed too much since it was written/is there a better reference? — intrigued_66, Jan 16 '23 at 13:04
@intrigued_66 Consider reading [How much of ‘What Every Programmer Should Know About Memory’ is still valid?](https://stackoverflow.com/a/47714514/12939557) (where the first answer is written Peter ;) ). — Jérôme Richard, Jan 16 '23 at 13:32

Can x86 take advantage of DDR5's 32 bit channels during random access?

0 Answers0