1

Well, my doubt is: When you buy a new RAM memory for your computer, you can see something like CL17 on it specifications. I know that CL is the same as CAS, but I have a question here: I've read in some posts that CAS is the amount of RAM clock cycles it takes for the RAM to output data called for by the CPU, but also I've read that we have to add RAS-to-CAS to that CAS to count the total RAM clock cycles it would take the RAM to output data requested from CPU.

So, is it correct to say that, in my example, CPU will wait 17 RAM clock cycles since it requests the DATA until the first data bytes arrive? Or we have to add the RAS-to-CAS delay? And, if we have to add RAS-to-CAS delay, how can I know how many cycles is RAS-to-CAS if the RAM provider only tells me that is "CL17"?

Edit: Supose that when I talk about the 17 cycles I'm refering to "17 RAM cycles between L3 misses and the reception of the first bytes of the data requested"

isma
  • 143
  • 1
  • 6
  • No, it's not that simple. Also keep in mind that a load has to miss in L3 cache before it even tries accessing memory, and latency within the CPU core + uncore is significant. See [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) for some details about DDR SDRAM which AFAIK still apply to modern DDR4. – Peter Cordes Jun 15 '20 at 15:48
  • I will edit my post. I wanted to say CL17 -> 17 RAM cycles between missing L3 and receiving the first bytes of data – isma Jun 15 '20 at 16:02
  • Assuming Sandybridge-family, there's still some uncore latency (over the ring bus) between the L3 slice and one of the memory controllers. Also, the memory controller has a queue. (L3 is distributed with a slice next to each core for each stop on the ring bus. A load request first goes to the slice that contains the set indexed by that address.) You could say "from the memory controller initiating a request for that cache line over the DDR memory bus." – Peter Cordes Jun 15 '20 at 16:06

1 Answers1

4

So, is it correct to say that, in my example, CPU will wait 17 RAM clock cycles since it requests the DATA until the first data bytes arrive? Or we have to add the RAS-to-CAS delay? And, if we have to add RAS-to-CAS delay, how can I know how many cycles is RAS-to-CAS if the RAM provider only tells me that is "CL17"?

No. This delay is only a small part of the total delay from when a core requests some memory and the line returns to the core.

In particular, the request must make its way all the way from the core, checking the L1, L2 and L3 caches, and to the memory controller, before the DRAM (and timings like CAS) even become involved. After the read occurs, it has to go all the way back. This trip usually accounts for much more of the total latency of RAM access than the RAM access itself.

John D McCalpin has an excellent blog post about the memory latency components on an x86 system. On that system the CAS delay of ~11 ns makes up only a bit more than 20% of the total latency of ~50 ns.

John also points out in a comment that on some multi-socket systems, the memory latencies may not even matter because snopping the other cores in the system takes longer than the response from memory.

About RAS-to-CAS vs CAS alone, it depends on the access pattern. The RAS-to-CAS delay is only needed if that row wasn't already open, in that case the row must be opened, and RAS-to-CAS delay incurred. Otherwise, if the row is already opened, only the CAS delay is required. Which case applies depending your access physical address access pattern, RAM configuration and how the memory controllers maps physical addresses to RAM addresses.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 1
    In systems with more than one shared last-level cache (e.g., 2-socket systems) and without memory directories, the data can't be used until the snoop response from the other socket(s) is received. In most cases that snoop response comes well after the data arrives from DRAM, and in many cases it comes after the DRAM response even if the DRAM page was not open. In these cases the latency does not appear to be dependent on page hits or misses. By judicious fiddling with frequencies and access patterns, it is often possible to set up cases in which the RAS adder is measurable. – John D McCalpin Jun 18 '20 at 17:09
  • @JohnDMcCalpin - thanks for the comment. I added a paragraph noting that memory latency may not matter on "some" multi-socket systems (readers can see your comment for the detail about memory directories, etc). About your last sentence, is it related to the multi-socket case? I would assume that if you were in the "snoop latency dominates" case, it would be hard to measure anything about memory latency, but perhaps that was an unrelated observation. Do you consider my last paragraph correct? – BeeOnRope Jun 18 '20 at 22:58
  • 2
    The usual way to see the extra details of DRAM latency is to reconfigure the system to a single socket. In SKX this is not needed because the memory holds a directory bit that (usually) says that you don't need to wait for the snoop from the other socket. Your last comment is important because consecutive accesses may interleave (or block interleave) around the available channels before coming back to the original rank/bank/row. These mappings are not always well-documented. Frequency changes help tease out how many cycles of the latency are spent in each frequency domain. – John D McCalpin Jun 19 '20 at 22:58