Why isn't there a data bus which is as wide as the cache line size?

Question

When a cache miss occurs, the CPU fetches a whole cache line from main memory into the cache hierarchy. (typically 64 bytes on x86_64)

This is done via a data bus, which is only 8 byte wide on modern 64 bit systems. (since the word size is 8 byte)

EDIT: "Data bus" means the bus between the CPU die and the DRAM modules in this context. This data bus width does not necessarily correlate with the word size.

Depending on the strategy the actually requested address gets fetched at first, and then the rest of the cache line gets fetched sequentially.

It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once. (this would be eight times larger than the word size)

Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.

What are the limitations that limit the size of the data bus?

There isn't such a thing as "the data bus" any more. Data moves over many buses in modern CPUs and they can have different widths. — David Schwartz, Aug 28 '16 at 08:12
With the term "data bus" I mean the bus between CPU and RAM. I am aware that there are many other buses, but I did not know any other term to describe this bus. — Mike76, Aug 28 '16 at 08:21
Even that term is ambiguous. The term "CPU" can mean the physical CPU die or just the parts of that die that perform the CPU function. So you could be referring to either the bus between the CPU and the RAM controller or the bus between the RAM controller and the RAM. Also, the bus between CPUs is also sometimes between the CPU and RAM (when one CPU accesses RAM connected to another CPU). There really isn't one data bus any more. — David Schwartz, Aug 28 '16 at 08:31
For DDR4 DRAM, the data bus is 64-bits wide for each module, and the CPU can talk to more than one module at a time. — David Schwartz, Aug 28 '16 at 08:43
Really this should be asked on an electronic related forum. The trade-off between narrower and wider buses are complex. You could think that wider always allow bigger bandwidth but things like skew and cross-talk between wires make that true only to a point and numerous factors influence the position of that point. — AProgrammer, Aug 28 '16 at 11:39

Peter Cordes · Accepted Answer · 2020-12-28T05:14:05.720

I think DRAM bus width expanded to the current 64 bits before AMD64. It's a coincidence that it matches the word size. (P5 Pentium already guaranteed atomicity of 64-bit aligned transfers, because it could do so easily with its 64-bit data bus. Of course that only applied to x87 (and later MMX) loads/stores on that 32-bit microarchitecture.)

See below: High Bandwidth Memory does use wider busses, because there's a limit to how high you can clock things, and at some point it does become advantageous to just make it massively parallel.

It would seem much faster if there was a bus with 64 byte width , which would allow to fetch a whole cache line at once.

Burst transfer size doesn't have to be correlated with bus width. Transfers to/from DRAM do happen in cache-line-sized bursts. The CPU doesn't have to send a separate command for each 64-bits, just to set up the burst transfer of a whole cache-line (read or write). If it wants less, it actually has to send an abort-burst command; there is no "single byte" or "single word" transfer command. (And yes that SDRAM wiki article still applied to DDR3/DDR4.)

Were you thinking that wider busses were necessary to reduce command overhead? They're not. (SDRAM commands are sent over separate pins from the data, so commands can be pipelined, setting up the next burst during the transfer of the current burst. Or starting earlier on opening a new row (dram page) on another bank or chip. The DDR4 wiki page has a nice chart of commands, showing how the address pins have other meanings for some commands.)

High speed parallel busses are hard to design. All the traces on the motherboard between the CPU socket and each DRAM socket must have the same propagation delay within less than 1 clock cycle. This means having them nearly the same length, and controlling inductance and capacitance to other traces because transmission-line effects are critical at frequencies high enough to be useful.

An extremely wide bus would stop you from clocking it as high, because you couldn't achieve the same tolerances. SATA and PCIe both replaced parallel busses (IDE and PCI) with high-speed serial busses. (PCIe uses multiple lanes in parallel, but each lane is its own independent link, not just part of a parallel bus).

It would just be completely impractical to use 512 data lines from the CPU socket to each channel of DRAM sockets. Typical desktop / laptop CPUs use dual-channel memory controllers (so two DIMMs can be doing different things at the same time), so this would be 1024 traces on the motherboard, and pins on the CPU socket. (This is on top of a fixed number of control lines, like RAS, CAS, and so on.)

Running an external bus at really high clock speeds does get problematic, so there's a tradeoff between width and clock speed.

For more about DRAM, see Ulrich Drepper's What Every Programmer Should Know About Memory. It gets surprisingly technical about the hardware design of DRAM modules, address lines, and mux/demuxers.

Note that RDRAM (RAMBUS) used a high speed 16-bit bus, and had higher bandwidth than PC-133 SDRAM (1600MB/s vs. 1066MB/s). (It had worse latency and ran hotter, and failed in the market for some technical and some non-technical reasons).

I guess that it helps to use a wider bus up to the width of what you can read from the physical DRAM chips in a single cycle, so you don't need as much buffering (lower latency).

Ulrich Drepper's paper (linked above) confirms this:

Based on the address lines a2 and a3 the content of one column is then made available to the data pin of the DRAM chip. This happens many times in parallel on a number of DRAM chips to produce a total number of bits corresponding to the width of the data bus.

Inside the CPU, busses are much wider. Core2 to IvyBridge used 128-bit data paths between different levels of cache, and from execution units to L1. Haswell widened that to 256b (32B), with a 64B path between L1 and L2

High Bandwidth Memory is designed to be more tightly coupled to whatever is controlling it, and uses a 128-bit bus for each channel, with 8 channels. (for a total bandwidth of 128GB/s). HBM2 goes twice as fast, with the same width.

Instead of one 1024b bus, 8 channels of 128b is a tradeoff between having one extremely wide bus that's hard to keep in sync, vs. too much overhead from having each bit on a separate channel (like PCIe). Each bit on a separate channel is good if you need robust signals and connectors, but when you can control things better (e.g. when the memory isn't socketed), you can use wide fast busses.

Perhaps there could be two different data bus widths, one for the standard cache line fetching and one for external hardware (DMA) that works only with word size memory access.

This is already the case. DRAM controllers are integrated into the CPU, so communication from system devices like SATA controllers and network cards has to go from them to the CPU over one bus (PCIe), then to RAM (DDR3/DDR4).

The bridge from the CPU internal memory architecture to the rest of the system is called the System Agent (this basically replaces what used to be a separate Northbridge chip on the motherboard in systems without an integrated memory controller). The chipset Southbridge communicates with it over some of the PCIe lanes it provides.

On a multi-socket system, cache-coherency traffic and non-local memory access also has to happen between sockets. AMD may still use hypertransport (a 64-bit bus). Intel hardware has an extra stop on the ring bus that connects the cores inside a Xeon, and this extra connection is where data for other sockets goes in or out. IDK the width of the physical bus.

"I guess that it helps to use a wider bus up to the width of what you can read from the physical DRAM chips in a single cycle, so you don't need as much buffering (lower latency)." - I agree with this point, so what can a DRAM chip read in one cycle? Only 64 Bits? — Mike76, Aug 28 '16 at 07:53
Well, a stick of memory has 8 or 16 DRAM chips. I *think* each DRAM chip might only be 1 bit, unless it has multiple arrays on the same chip that all use the same row and column address. (It's literally a matrix, and reads the data from the element where the row and column lines cross.) It's totally plausible that each DRAM chip might read out 8 bits in parallel, which would match perfectly with a 64-bit bus. — Peter Cordes, Aug 28 '16 at 07:56
And since the CPU internal busses can have a much wider bandwidth without any problems, maybe the solution is to integrate CPU and DRAM more closely together — Mike76, Aug 28 '16 at 07:57
@Mike76: Well, that does let you clock it faster and with lower latency. You don't usually want to build DRAM on the same piece of silicon as a CPU. Apparently there are process tweaks that are good for a CPU but bad for DRAM, and vice versa. Intel does put 128MB or 256MB of eDRAM inside the same package as their CPUs, in some models of Haswell / Broadwell / Skylake. (embedded DRAM). It boosts integrated graphics performance, and also a few benchmarks. It's also called CrystalWell, and goes with Iris Pro graphics. — Peter Cordes, Aug 28 '16 at 08:00
See http://arstechnica.com/information-technology/2015/08/the-many-tricks-intel-skylake-uses-to-go-faster-and-use-less-power/, and http://www.anandtech.com/show/9582/intel-skylake-mobile-desktop-launch-architecture-analysis/5 for more details. I doubt they use a really wide bus internally. The more controlled environment probably just means they can run the clock faster. It does have significantly lower latency. — Peter Cordes, Aug 28 '16 at 08:00
@Mike76: see my last edit; I just remembered that HBM existed. I checked, and it does use multiple channels of very wide busses. — Peter Cordes, Aug 28 '16 at 08:07
@Mike76: I checked the *everything you should know about memory* paper, and it does say that a DDR SDRAM memory module reads a bus width of data in parallel. It gets *really* technical about memory, but still aimed at programmers, not electrical engineers; there's a reason I linked it. — Peter Cordes, Aug 28 '16 at 08:24
Thanks for the updates. Actually I would consider a data bus with 64 bit width as parallel connection and not as a serial one. A real serial connection like PCI Express has only 4 data wires, two for sending and two for receiving. So it seems that almost everything works serial except of the data bus between RAM and CPU — Mike76, Aug 28 '16 at 09:16
@Mike76: Did I accidentally say a 64-bit bus was a serial bus anywhere? But yes, high-speed serial busses are very popular these days. I expect the inter-socket interconnects are also parallel. There was some time after memory controllers were integrated with CPUs that they still used parallel busses to talk to chipset northbridges, I think. (e.g. early Core2 days). I might be totally wrong. But anyway, yes I think now the CPU's connection to system devices is almost exclusively via PCIe. — Peter Cordes, Aug 28 '16 at 09:22
You did not say that, but the interesting thing is that the usual argumentation why serial buses are faster than parallel ones does not apply in this case. And there is hardly anything with more throughput than the CPU-RAM connection. If serial buses were faster in all cases, we would use a bus similar to PCIe between northbridge and DRAM. However, this is not the case, and it seems that the 64 bit parallel bus width is some kind of a tradeoff — Mike76, Aug 28 '16 at 09:39
@Mike76: Yes, there's a tradeoff. My answer is in serious need of some editing; it's currently a jumble of facts. I think somewhere in there I said there's a limit to how fast you can run a bus. If you need more throughput than you can get from a serial bus in the low single-digit GHz range, with whatever encoding tricks you want to use, then you need to bit the bullet and require the tight tolerances required for a parallel bus to work at such high speed. The wider it is, the tighter the tolerances need to be, so yes it's absolutely a tradeoff (also with trace layout / pin count). — Peter Cordes, Aug 28 '16 at 09:55
Thank you for your help, you could refactor the answer, but in contains all the facts that I wanted to know, therefore I marked it as accepted — Mike76, Aug 29 '16 at 14:09

Stephen Plyaskin · Answer 2 · 2016-08-27T16:48:42.793

2

I think there is physical/cost trouble. in addition to the data lines (64) has a address lines (15+) and bank_select lines (3). Plus other lines (CS, CAS, RAS...). For example see 6th Generation Intel® Core™ Processor Family Datasheet. In general, about 90 lines for only one bus and 180 for two. There are other lines (PCIe, Dysplay...) The next aspect is burst reading. With bank_select we can select one of 8 banks. In burst mode with one writing of address at all banks we reading data from all banks by bank per tick.

edited Aug 27 '16 at 16:48

answered Aug 27 '16 at 15:06

Stephen Plyaskin

111
8

So you think that the total numbers of lines required would be too expensive to implement? I would assume that the number of control lines would stay roughly the same when the data bus width gets increased to 512 Bits. – Mike76 Aug 27 '16 at 17:16
2

Well yes. There is parameter `delta_performance / delta_cost`. If we increase performance two times, so we must pay twice or less. If we pay more than two times then we dont need this system. I imagined that we have 512 width databus when was 64. We increase memory bandwidth not in eight times. Less. There are latencies. How much chips of memory do you need now? Where to place? And how to trace PCB? Cost increases. Next, there are sockets with 3 and 4 memory interfaces (e.g. Socket B2, R), so there is trend increase the interfaces amount not databus width. – Stephen Plyaskin Aug 27 '16 at 20:00
I am aware of the DRAM-latencies, but this is the reason why a cache is needed in the first place. It takes longer to fetch data, so why not transfer more data at once? The same principle is used for disk I/O, where whole file system blocks are fetched into the page cache at once. (possible using read-ahead strategy) The correlation of caches, required bus widths and latency is a standard concept in computer science. – Mike76 Aug 28 '16 at 07:19
But of course, if it is too expensive it won't make sense to implement – Mike76 Aug 28 '16 at 07:26
2

@Mike76: burst transfer size doesn't have to be correlated with bus width. Transfers to/from DRAM do happen in cache-line-sized bursts. (I put this comment into my answer). – Peter Cordes Aug 28 '16 at 09:27

Why isn't there a data bus which is as wide as the cache line size?

2 Answers2

Linked