-1

I read somewhere that to find the real latency of the ram you can use the following rule :

1/((RAMspeed/2)/1000) x CL = True Latency in nanoseconds

ie for DDR1 with 400Mhz clock speed, is it logical to me to divide by 2 to get the FSB speed or the real bus speed which is 200Mhz in this case. So the rule above seems to be correct for the DDR1.

From the other side, DDR2 also doubles the freq of the bus in relation to the previous DDR1 generation (ie 4 bits per clock cycle) according the article "What Every Programmer Should Know About Memory".

So, in the case of a DDR2 with a 800Mhz clock speed, to find the "True Latency" the above rule should be accordingly changed to

1/((RAMspeed/4)/1000) x CL = True Latency in nanoseconds

Is that correct? Because in all the cases I read that the correct way is to take RAMspeed/2 no matter if it's DDR, DDR2, DDR3 or DDR4.

Which is the correct way to get the true latency?

Robert Houghton
  • 1,202
  • 16
  • 28
Maverick
  • 1,105
  • 12
  • 41
  • This is better addressed to [Quora](http://quora.com). – tadman Mar 22 '18 at 19:03
  • 2
    The question is inspired from the article "What Every Programmer Should Know About Memory" and surely it concerns programmers – Maverick Mar 22 '18 at 19:17
  • Be that as it may, it's not directly related to code you're writing so it's off topic. DDR memory is no longer as simple as it was back in the 1990s. DDR4 and DDR5 combined with NUMA make for a very complicated formula. The way you get the true latency is to benchmark specific hardware. – tadman Mar 22 '18 at 19:34
  • @tadman I am not interesting about bechmark, what I am really asking is if the general formula of "true latency" I read in different websites is wrong or just a misinformation according to the article I mentioned or is something that I did not understand reading the article. – Maverick Mar 22 '18 at 19:45
  • 1
    There's no general formula. Memory is far more complicated than a simple article can tackle. There's L1, L2, L3 caches, there's [NUMA](https://en.wikipedia.org/wiki/Non-uniform_memory_access) implications on certain kinds of multi-core systems, there's multi-channel memory, etc. In the old days we had a single CPU and it was wired directly to system memory at the core clock-speed, 1:1, but those days are long gone. – tadman Mar 22 '18 at 20:00
  • @tadman Com' on man you're kidding me !! L1, L2, L3 are inside CPU and multicore CPUs or multi-channels are all off topic. I am not interesting about this, neither NUMA implication in the programming memory management or parellelism in general. My question is specific and it concerns ONLY dynamic main memory and the time it needs to prepare the data for the cpu. – Maverick Mar 22 '18 at 20:09
  • 1
    I'm not kidding you. There is no simple formula. Single core machines don't really exist any more, and in multi-core machines, especially those with four or more, the memory architecture is surprisingly complicated. It also differs substantially from Intel to AMD to ARM. Latency and clock speed are only loosely related. DDR3 memory that's "slower" than DDR4 memory actually has faster first byte response times, but lower bandwidth. – tadman Mar 22 '18 at 20:50
  • 1
    DDR SDRAM latency is more variable than your formula indicates. You get lower latency for another access within an already-open DRAM page (not the same thing or the same size as a 4k virtual-memory page), so locality of accesses can matter even in the range of 16kiB or so, not just within the same cache line or the same 4k page (TLB entry). (And as I commented on the answer, this is only talking about latency between memory controller and DRAM, ignoring latency between an execution core and memory controller inside the CPU, or especially between sockets. It's non-negligible. – Peter Cordes Mar 23 '18 at 02:32
  • @Peter This is not my formual as i cleary stated. But its an approximation of the estimated latency times read on some articles. My question was if the formual is totally wrong because of DDRs clock speed increments which its finally seems that formula was not totally wrong at all. Of course is an approximation and not an accurate calculation of the latency times. – Maverick Mar 23 '18 at 09:08

2 Answers2

3

The CAS latency is in memory-bus clock cycles. This is always one half the transfers-per-second number. e.g. DDR3-1600 has a memory clock of 800MHz, doing 1600M transfers per second (during a burst transfer).

DDR2, DDR3, and DDR4 still use a double-pumped 64-bit memory bus (transferring data on the rising and falling edges of the clock signal), not quad-pumped. This is why they're still called Double Data-Rate (DDR) SDRAM.


The FSB speed has nothing to do with it.

On old CPUs without integrated memory controllers, i.e. systems that actually have an FSB, its frequency is often configurable (in the BIOS) separately from the memory speed. See Front Side Bus and RAM speed; on even older systems, the FSB and memory clocks were synchronous.

Normally systems were designed with a fast enough FSB to keep up with the memory controller. Running the FSB at the same clock speed as the memory can reduce latency by avoiding buffering between clock domains.


So yes, the CAS latency in seconds is cycle_count / frequency, or more like your formula
1000ns/us * CL / RAMspeed * 2 transfers/clock, where RAMspeed is in mega-transfers per second.

Higher CL numbers at a higher memory frequency often work out to a similar absolute latency (in seconds). In other words, modern RAM has higher CAS latency timing numbers because more clock cycles happen in the same amount of time.

Bandwidth has vastly improved, while latency has stayed nearly constant, according to these graphs from Crucial which explain CL vs. frequency.


Of course this is not "the memory latency", or the "true" memory latency.

It's the CAS latency of the DRAM itself, and is the most important factor in latency between the memory controller and the DRAM, but is only a part of the latency between a CPU core and memory. There is non-negligible latency inside the CPU between the core and uncore (L3 and memory controller). Uncore is Intel terminology; IDK what AMD calls the parts of the memory hierarchy in their various microarchitectures.

Especially many-core Xeon CPUs have significant latency to L3 / memory controller, because of the large ring bus(es) connecting all the cores. A many-core Xeon has worse L3 and memory latency than a similar dual or quad-core with the same memory and CPU clock frequencies.

This extra latency actually limits single-thread / single-core bandwidth on a big Xeon to worse than a laptop CPU, because a single core can't keep enough requests in flight to fill the memory pipeline with that much latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I wanted to refresh my knowledge after buying new RAM and found this answer. It was very helpful. Thank you – Dennis Kassel Jun 08 '21 at 13:08
  • @DennisKassel: See also [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) for more info, starting from basics of DRAM and the bus used by (DDR) SDRAM, and then covering details of cache and its effect on performance. – Peter Cordes Jun 08 '21 at 13:10
  • I have already downloaded the document but the original one is too large for me. But I also found a comprehension of the book and will read it in the near future. – Dennis Kassel Jun 11 '21 at 19:48
0

Ok I found the answer.

Everytime the manufacturers increased the memory clock speed they did it in a constant rate which always was the double (2x) of the FSB clock speed. ie

MEM CLK      FSB
-------------------
DDR200      100 MHz    
DDR266      133 MHz    
DDR333      166 MHz
DDR400      200 MHz
DDR2-400    200 MHz
DDR2-533    266 MHz
DDR2-667    333 MHz
DDR2-800    400 MHz
DDR2-1066   533 MHz
DDR3-800    400 MHz
DDR3-1066   533 MHz
DDR3-1333   666 MHz
DDR3-1600   800 MHz

So, the memory module has always the double speed of the FSB.

Maverick
  • 1,105
  • 12
  • 41
  • Again, this isn't a programming question, it's off-topic here. – tadman Mar 22 '18 at 20:51
  • CPUs with integrated memory controllers don't *have* a front-side bus (connecting CPU to northbridge). There is no northbridge anymore; it's integrated into the CPU. Anyway, this answer appears to be simply stating that the memory clock speed matches the FSB, and thus the memory transfers per second is twice the frequency, transferring data on both the up and down edges of the clock signal. **This is what [Double Data Rate (DDR)](https://en.wikipedia.org/wiki/DDR_SDRAM) literally means**. – Peter Cordes Mar 23 '18 at 02:24
  • And at best this is might tell you something about the latency between the memory controller and the DRAM, but there's still an unknown amount of latency between an execution core and the memory controller, especially in a multi-core CPU where the logic outside each core has to arbitrate access to L3 / memory. For example, many-core Xeon CPUs have worse latency to DRAM *or even to L3* than a quad-core CPU of the same microarchitecture, even when using identical RAM chips. https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory – Peter Cordes Mar 23 '18 at 02:28
  • @Peter I already know that the connection between ram and cpu (and of course with other speedy devices) are not simple anymore and some designs are implemended the ram controllers inside the cpu (or parallel design of the original FSB is becoming a serial). But for simplicity, I took an academic example just to verify the theory in the original article "What Every Programmer Should Know About Memory" with the practice. And it seems for the sake of simplicity that the formula above is not wrong at all. – Maverick Mar 23 '18 at 09:14
  • 1
    I reread the question to see what you were *actually* asking, instead of just stopping after seeing nonsense like "true latency". It's not the FSB that matters, it's the memory clock, and yes it's always half the transfer rate, like I said. Nobody had to choose to increase two separate things in step with each other. So this answer doesn't explain why the formula is right, and isn't even correct for systems that support async memory clocks. – Peter Cordes Mar 23 '18 at 09:49