SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

Question

Hello Forum – I have a few similar/related questions about SIMD intrinsic for which I searched online including stackoverflow but did not find good answers so requesting your help.

Basically I am trying to understand how a 64 bit CPU fetches all 128 bits in a single read and what are the requirements for such an operation.

Would CPU fetch all 128 bits from memory in a single memory operation or will it do two 64 bit reads?
Do CPU manufactures demand certain size of the memory bus, example, for a 64 bit CPU, would Intel require 128 bit bus for SSE memory bound operations?
Are these operations dependent on memory bus size, number of channels and number of memory modules?

Peter Cordes · Answer 1 · 2022-05-05T13:59:13.120

8

Loads/stores don't go to directly to memory (unless you use them on an uncacheable memory region). Even NT stores go into a write-combining fill buffer.

Load/stores go between execution units and L1D cache. CPUs internally have wide data paths from cache to execution units, and from L1D to outer caches. See How can cache be that fast? on electronics.SE, about Intel IvyBridge.

e.g. IvB has 128b data paths between execution units and L1D. Haswell widened that to 256 bits. Unaligned loads/stores have full performance as long as they don't cross a cache-line boundary. Skylake-AVX512 widened that to 512 bits, so it can do 2 64-byte loads and a 64-byte store in a single clock cycle. (As long as data is hot in L1D cache).

AMD CPUs including Ryzen handle 256b vectors in 128b chunks (even in the execution units, unlike Intel after Pentium M). Older CPUs (e.g. Pentium III and Pentium-M) split 128b loads/stores (and vector ALU) into two 64-bit halves because their load/store execution units were only 64 bits wide.

The memory controllers are DDR2/3/4. The bus is 64-bits wide, but uses a burst mode with a burst size of 64 bytes (not coincidentally, the size of a cache line.)

Being a "64-bit" CPU is unrelated to the width of any internal or external data buses. That terminology did get used for other CPUs in the past, but even P5 Pentium had a 64-bit data bus. (aligned 8-byte load/store is guaranteed atomic as far back as P5, e.g. x87 or MMX.) 64-bit in this case refers to the width of pointers, and of integer registers.

Further reading:

David Kanter's Haswell deep dive compares data-path widths in Haswell vs. SnB cores.
What Every Programmer Should Know About Memory (but note that much of the software-prefetch stuff is obsolete, modern CPUs have better HW prefetchers than Pentium4). Still essential reading, especially if you want to understand how CPUs are connected to DDR2/3/4 memory.
Other performance links in the x86 tag wiki.
Enhanced REP MOVSB for memcpy for more about x86 memory bandwidth. Note especially that single-threaded bandwidth can be limited by max_concurrency / latency, rather than by the DRAM controller, especially on a many-core Xeon (higher latency to L3 / memory).

edited May 05 '22 at 13:59

answered Nov 27 '17 at 14:04

Peter Cordes

328,167
45
605
847

1

Ice Lake is supposed to add a "Fast Short REP MOV" - whatever that's supposed to mean. – Mysticial Nov 27 '17 at 16:14
@Mysticial: Nice! Presumably the break-even threshold where a vector loop is better than `rep movsb` will be lower than on Skylake (where it's maybe 128 or 256 bytes for aligned pointers). – Peter Cordes Nov 27 '17 at 16:17
@PeterCordes - Thanks for the detailed answer and pointers, I have a follow up question: If the bus is 64 bit wide then why should the data be aligned to 16 byte boundary, why not 8 bytes? – Forum Member Nov 27 '17 at 18:07
@ForumMember - because there is no single "bus" as Peter mentions. At least the early parts of the path to memory are 256 or 128 bits wide on modern CPUs. Beyond that there are many alignment concerns that go beyond bus width. @Peter - regarding your comment, is there any threshold where above/below `rep movsb` is faster than a vector loop? My impression was that explicit code was faster at all sizes, certainly on Skylake and the last few generations (your code needs NT stores for large sizes though). Your comment seems to imply that `rep movsb` can be faster for larger loops? – BeeOnRope Nov 27 '17 at 18:48
@BeeOnRope: I thought `rep movsb` was at least worth using once you take into account I-cache effects on the rest of the program. I think glibc uses it for large enough copies on some CPUs. It certainly has a code-path for that, but I forget if it's actually set up to use it. (BTW, on Haswell/Skylake Pentium/Celeron, AVX isn't available but `rep movsb` is probably still 32-byte, so it's a big win beyond the smallest sizes.) – Peter Cordes Nov 28 '17 at 03:00
Right, well it's hard to take into account "i-cache effects" in a rigorous way, so usually you just talk about speed and then as a side note if some implementations have large cache footprint. That said, the performance of `rep movsb` is far enough below vectorized implementations that it's hard to see the icache effect overwhelming the slower speed for almost any application unless the sizes were quite small (since the icache effect is a "one off" per call, but the better throughput has an increasing advantage with size). – BeeOnRope Nov 28 '17 at 03:11
I'm not really aware of high performance copy implementations really using `rep movsb` except with special requirements (e.g., the Linux kernel uses it due to the prohibition against vector instructions in the kernel, so it's the next best way to get close). – BeeOnRope Nov 28 '17 at 03:12

SIMD intrinsic and memory bus size - How CPU fetches all 128/256 bits in a single memory read?

1 Answers1

Linked