Will memory that is physically closer to the CPU perform faster than memory physically farther away?

Question

I know this may sound like a silly question considering the speeds at which computers work, but say a certain address in RAM is physically closer to the CPU on the motherboard, compared to a memory address that is located the farthest possible to the CPU, will this have an affect on the speed that the closer memory address is accessed compared to the farthest memory address?

@KenWhite: Distance is a factor in modern computing performance; the distances between components are on a scale similar with the speed of light multiplied by a CPU cycle. So it is entirely sensible to wonder whether proximity could have an effect. Questions such as yours are off-putting; they disapprove of asking questions, which is a bad thing to teach students, or anybody. Stop dissuading people from wondering and inquiring. They lead to learning and to new discovery and invention. — Eric Postpischil, Sep 24 '20 at 18:55

Peter Cordes · Answer 1 · 2020-09-24T19:45:44.730

If you're talking about NUMA accessing RAM connected to this socket vs. going over the interconnect to access RAM connected to another socket, then yes this is a well known effect. example. Otherwise, no.

Also note that signal travel time over the external memory bus is only tiny fraction of the total latency cache-miss latency cost for a CPU core. Queuing inside the CPU, time to check L3 cahce, and the internal bus between cores and memory controllers, all adds up. Tightening DDR4 CAS latency by 1 whole memory cycle makes only a small (but measurable) difference to overall memory performance (see hardware review sites benchmarking memory overclocking), other timings even less so.

No, DDR4 (and earlier) memory busses are synced to a clock and expect a response at a specific number of memory-clock cycles¹ after a command (so the controller can pipeline requests without causing overlap). See What Every Programmer Should Know About Memory? for more about DDR memory commands and memory timings (and CAS latency vs. other timings).

(Wikipedia's introduction to SDRAM mentions that earlier DRAM standards were asynchronous, so yes they maybe could just reply as soon as they had the data ready. If that happened to be a whole clock cycle early, a speedup was perhaps possible.)

So memory latency is discrete, not continuous, and being 1 mm closer can't make it fractions of a nanosecond faster. The only plausible effect is if you socket all the memory into DIMM slots in a way that enables you to run tighter timings and/or a faster memory clock than with some other arrangement. Go read about memory overclocking if you want real-world experience with people who try to push systems to the limits of stability. What's best may depend on the motherboard; physical length of traces isn't the only consideration.

AFAIK, all real-world motherboard firmwares insist on using the same timings for all DIMMs on all memory channels².

So even if one DIMM could theoretically support tighter timings than another, you couldn't actually configure a system to make that happen. e.g. because of shorter or less noisy traces, less signal reflection because it's at the end instead of middle of some traces, or whatever. Physical proximity isn't the only thing that could help.

(This is probably a good thing; interleaving physical address space across multiple DRAM channels allows sequential reads/writes to benefit from the aggregate bandwidth of all channels. But if they ran at different speeds, you might have more contention for shared busses between controllers and cores, and more time left unused.)

Memory frequency and timings are usually chosen by the firmware after reading the SPD ROM on each DIMM (memory module) to find out what memory is installed and what timings each DIMM is rated for at what frequencies.

Footnote 1: I'm not sure how transmission-line propagation delays over memory traces are accounted for when the memory controller and DIMM agree on how many cycles there should be after a read command before the DIMM starts putting data on the bus.

The CAS latency is a timing number that the memory controller programs into the "mode register" of each DIMM.

Presumably the number the DIMM sees is the actual number it uses, and the memory controller has to account for the round-trip propagation delay to know when to really expect a read burst to start arriving. Other command latencies are just times between sending different commands so propagation delay doesn't matter: the gap at the sending side equals the gap at the receiving side.

But the CAS latency seen by the memory controller includes the round-trip propagation delay for signals to go over the wires to the DIMM and back. Modern systems with DDR4-4000 have a clock that runs at 2GHz, cycle time of half a nanosecond (and transferring data on the rising and falling edge).

At light speed, 0.5ns is "only" about 15 cm, half of one of Grace Hopper's nanoseconds, and with transmission-line effects could be somewhat shorter (like maybe 2/3rd of that). On a big server motherboard it's certainly plausible that some DIMMs are far enough away from the CPU for traces to be that long.

The rated speeds on memory DIMMs are somewhat conservative so they're still supposed to work at that speed even when as far as allowed by DDR4 standards. I don't know the details, but I assume JEDEC considers this when developing DDR SDRAM standards.

If there's a "data valid" pin the DIMM asserts at the start of the read burst, that would solve the problem, but I haven't seen a mention of that on Wikipedia.

Timings are those numbers like 9-9-9-24, with the first one being CAS latency, CL. https://www.hardwaresecrets.com/understanding-ram-timings/ was an early google hit if you want to read more from a perf-tuning PoV. Also described in Ulrich Drepper's "What Every Programmer Should Know about Memory" linked earlier, from a how-it-works PoV. Note that the higher the memory clock speed, the less real time (in nanoseconds) a given number of cycles is. So CAS latency and other timings have stayed nearly constant in nanoseconds as clock frequencies have increase, or even dropped. https://www.crucial.com/articles/about-memory/difference-between-speed-and-latency shows a table.

Footnote 2: Unless we're talking about special faster memory for use as a scratchpad or cache for the larger main memory, but still off-chip. e.g. the 16GB of MCDRAM on Xeon Phi cards, separate from the 384 GB of regular DDR4. But faster memories are usually soldered down so timings are fixed, not socketed DIMMs. So I think it's fair to say that all DIMMs in a system will run with the same timings.

Other random notes:

https://www.overclock.net/threads/ram-4x-sr-or-2x-dr-for-ryzen-3000.1729606/ contained some discussion of motherboards with a "T-topology" vs. "daisy chain" for the layout of their DIMM sockets. This seems pretty self-explanatory terminology: a T would be when each of the 2 DIMMs on a channel are on opposite sides of the CPU, about equidistant from the pins. vs. "daisy chain" when both DIMMs for the same channel are on the same side of the CPU, with one farther away than the other.

I'm not sure what the recommended practice is for using the closer or farther socket. Signal reflection could be more of a concern with the near socket because it's not the end of the trace.

If you have multiple DIMMs on the same memory channel by the "chip-enable" pin , the DDR4 protocol may require they all run at the same timings. (Such DIMMs see each others commands, except there's a "chip-select" pin that the memory controller can control independently for each DIMM to control which one the command is for.

But in theory a CPU could be designed to run its different memory channels at different frequencies, or at least different timings at the same frequency if the memory controllers all share a clock. And of course in a multi-socket system, you'd expect no physical / electrical obstacle to programming different timings for the different sockets.

(I haven't played around in the BIOS on a multi-socket system for years, not since I was a cluster sysadmin in AMD K8 / K10 days). So IDK, it's possible that some BIOS might have options to control different timings for different sockets, or simply allow different auto-detect if you use slower RAM in one socket than in others. But given the price of servers and how few people run them as hobby machines, it's unlikely that vendors would bother to support or validate such a config.

There have been attempts ar asynchronous processors, where things are not gated by a master clock but fire when their inputs are complete (or may compute continuously and pass along a signal when their inputs are complete, or whatever). I do not know of any in commercial use, but I vaguely recall seeing something about that recently. “Recently” being within a decade or two. — Eric Postpischil, Sep 24 '20 at 02:32
[Wikipedia on asynchronous circuits.](https://en.wikipedia.org/wiki/Asynchronous_circuit) — Eric Postpischil, Sep 24 '20 at 02:38
@EricPostpischil: Interesting, provides a nice contrast with the point in my answer about synchronous logic. Not sure async would be viable for DRAM at all, though; at least periodic refresh is necessary, and we definitely want to pipeline burst transfers over the bus. (Unless everything was designed very differently...). So there's a reason why many parts of computers still use synchronous logic. — Peter Cordes, Sep 24 '20 at 02:47
The on-chip network distance can have a minor effect on latency. Typically, one network hop is one cycle of latency (usually less than half a nanosecond); with latency greater than 50ns a hop is less than 1% of th.e latency. Queuing delays in the memory controller and on-chip network complicate latency further. (Checking caches will generally add more delay than the on-chip network.) Reality is usually more complex than simple models, but in this case the simple model seems rather accurate. — , Sep 24 '20 at 16:44
@PaulA.Clayton: Oh certainly it would be pretty negligible as part of the overall miss latency. I didn't mention any of that because it stays constant, but maybe I should have added a note for beginner audiences. But I thought it was fun to explore the possibility of physical proximity allowing you to tighten one or more of the DDR4 timings (CL, tRC, etc) by one or 1/2 memory clock cycle over the external DDR4 bus. Those timings, especially CL, do have a marginal but measurable effect on performance, as shown by benchmarks on hardware review sites when overclocking memory. — Peter Cordes, Sep 24 '20 at 17:15
About "I'm not 100% sure if the timings are from the POV of the memory controller or from the DRAM", I'm not sure it makes sense to sense to say it is one or the other. The timings are "relative" to the clock as seen by each component, and define relative delays: i.e., do this, then do that X clocks later. In that scenario, I don't think it makes sense to talk about the timings being from the point of view of only one party or the other. — BeeOnRope, Sep 24 '20 at 18:22
Even if the signal propagation was slow enough that in some kind of "absolute time" (this is itself a tricky concept when considering relatively) the signal could be delayed by more than a whole clock period, both sides would still be fine interpreting the timings as the relative delays between various signals. Of course, this means that when a signal "turns around" (e.g., who is driving vs reading a particular line), you'll need enough clocks of delay that the propagation time is accounted for, but I don't think it means timing is from one PoV or the other. — BeeOnRope, Sep 24 '20 at 18:24
@BeeOnRope: So the traces have to be short enough that the propagation delay is less than half a clock? (DDR transfers on rising and falling edge). If the DRAM starts transferring say 15 cycles after the command arrives, but the round-trip signal-propagation latency added up to a cycle, wouldn't the controller see the burst arriving 16 cycles after it sent the read command? That can be fine, but presumably the memory controller has to account for that extra latency somehow when it tells the DIMM how soon to respond to commands. — Peter Cordes, Sep 24 '20 at 18:25
@BeeOnRope: (I've been assuming that the controller sends a config setting to the DIMM somehow that tells it what CAS latency it expects. So the memory controller can schedule / pipeline transfers. If that's not the case, then only the controller knows a cycle count and the DIMM just starts spewing data on the first clock when it's able to, or after some other pin signal? I should re-read the DDR wiki page...) — Peter Cordes, Sep 24 '20 at 18:29
No, I think my point was that I don't think the traces have to be that short: that some type of relative delay between the devices was fine. However, maybe that's nonsense. In any case, if you assume that there is not more than a cycle of skew, the commends hold? The timings are relative, i.e., the delay between two events, and this delay will be seen in the same way for two events originating from the same component. For delays that involve a turnaround, e.g., CAS latency, the delay has to be long enough to accommodate the relative skew between the components. — BeeOnRope, Sep 24 '20 at 18:55
Note that I am really beyond the limit of my knowledge here and I could be spouting nonsense. In a semi-related note, propagation time for small traces can be _much_ higher than the speed of light, 100s of times slower for CPU traces, e.g., as mentioned [here](https://twitter.com/victorxstewart/status/1293719293365047298/photo/1). So speed of propagation is a real concern: on a CPU it takes a few cycles just to get across the die and back. — BeeOnRope, Sep 24 '20 at 18:57
"... sends a config setting to the DIMM somehow that tells it what CAS latency it expects" I thought RAM was dumber, that the timings are mostly all in the controller and the controller just has to stay within in the limits of the RAM (the RAM just responds to inputs, but it can't do it correctly if the inputs are too fast), but I guess this is wrong: it does seem like CAS does have to be programmed in the RAM, as well as some others like command rate. Various other timings may be controller only. — BeeOnRope, Sep 24 '20 at 18:58
@BeeOnRope: [Wikipedia says](https://en.wikipedia.org/wiki/Synchronous_dynamic_random-access_memory#Timing) "*In operation, CAS latency is a specific number of clock cycles **programmed into the SDRAM's mode register** and expected by the DRAM controller.*" My earlier comment was a guess (because I'm similarly out of my depth and making stuff up), but it seems in this case I guessed right :P Many of the other timings are just how long the controller must wait between commands like open a bank and read from that bank, and yes propagation delay doesn't matter there. — Peter Cordes, Sep 24 '20 at 19:02
@BeeOnRope: Thanks for that link about on-chip traces being so slow. A coax BNC cable's transmission speed is something like 2/3 c. A random google hit (https://electronic-products-design.com/geek-area/electronics/pcb-design/high-speed/propagation-delay-and-pcb-layout) says a mobo trace might be 150mm per ns. But on-chip traces (metal layers, I assume they mean?) are much smaller diameter, and packed close to other conductors. So it makes sense that transmission-line effects are a huge deal there, with higher capacitance to ground / stuff, and resistance. (IDK about inductance.) — Peter Cordes, Sep 24 '20 at 19:09
@BeeOnRope https://practicalee.com/transmission-lines/ points out that if you ignore series resistance, the propagation delay depends only on the dielectric coefficient of the insulator (between the conductors in a coax cable, for example), because . And shows how to model a transmission line as a ladder of L/C components. But it simplifies to a lossless model, where I think RC delays in on-chip traces probably *are* significant. The tiny diameter (and high resistance) of metal in modern CPUs is a known problem as transistors shrink. So anyway, mobo traces are much faster than 4mm/ns — Peter Cordes, Sep 24 '20 at 19:20

Will memory that is physically closer to the CPU perform faster than memory physically farther away?

1 Answers1