Memory hierarchy latency information

Question

In the "Example" section of this post, the author lists the latencies of all memory components register/L1/L2/RAM... My question is: how do you measure (find online) what the real latencies are for any given chip? Let's say

model name  : Intel(R) Core(TM)2 Duo CPU     E4600  @ 2.40GHz
stepping    : 13
cpu MHz     : 1200.000

I've tried digging up the information from the Intel Manuals as well, but for the life of me, those things are huge, I wouldn't know where to look for the information.

Thanks.

CPU latency is a little tricky - instructions can be pipelined so it is somewhat difficult to measure - gmplib has a paper that briefly goes over it. http://gmplib.org/~tege/x86-timing.pdf In short, register latency isn't a simple thing - you can 'execute' an instruction, and it may appear to be done, but until you have retrieved the result it might not actually be done executing. A smart compiler tries very hard to order the instructions to take advantage of this. (Register latency is dependent on the current CPU state / previously executed instructions) — rsaxvc, Jan 05 '12 at 01:26

Eldar Abusalimov · Accepted Answer · 2016-01-01T23:15:56.290

A simple google query ("intel cpu cache latency") reveals an interesting research of Intel: Measuring Cache and Memory Latency and CPU to Memory Bandwidth. In this paper authors use LMbench to perform the measurements.

How to take Measurements

Use the executable binary file called “lat_mem_rd” found in the “bin” folder of the utility’s directory. Next, use the following command line:
taskset 0x1 ./lat_mem_rd –N [x] –P [y] [depth] [stride]
Where [x] equals the number of times the process is run before reporting latency. Typically setting this to ‘1’ is sufficient for accurate measurements. For the ‘-P’ option, [y] equals the number of processes invoked to run the benchmark. The recommendation for this is always ‘1.’ It is sufficient to measure the access latency with only one processing core or thread. The [depth] specification indicates how far into memory the utility will measure. In order to ensure an accurate measurement, specify an amount that will go far enough beyond the cache so that it does not factor in latency measurements.

Understanding the Results

Since L1 and L2 cache latency ties to the core clock, CPU frequency plays a role in how fast memory accesses happen in real time. This means the number of core clocks stays the same independent of the core frequency. For a comparable result, it is best to convert the latency given by LMBench from nanoseconds into CPU clocks. To do this, multiply the latency by the processor frequency.
Time(seconds) * Frequency(Hz) = Clocks of latency
Therefore, if a 2.4 GHz processor takes 17 ns to access a certain level of cache, this converts to:
17 x 10-18 seconds * 2400000000 Hz = 17 ns * 2.4 GHz ≈ 41 Clocks

score 2 · Answer 2 · answered Mar 09 '12 at 02:15

2

A quick solution that you can hack to fit your needs: http://code.google.com/p/mem-latency/

It measures latency by loading linked list of varying sizes.

answered Mar 09 '12 at 02:15

etep

93
5

Ismael Luceno · Answer 3 · 2012-01-05T18:44:24.457

To make the measurements, you need to do it early, on the bare metal, because you don't want any interference (i.e. clock rate changes, bus contention, etc.).

You will have to write a little bit of assembler code... on x86 the steps would be:

execute a serializing instruction
read the time stamp counter
execute a serializing instruction
do a memory read
execute a serializing instruction
read the time stamp counter again
execute a serializing instruction
do the math

Once you got that done all you need is to start planning and playing with the caches. Keep in mind the cache sizes and architecture play a huge role here, so you'll need to tailor the measurements to the subject in question. Also you may want to play with prefetching to make the filling easier.

Memory hierarchy latency information

3 Answers3

How to take Measurements

Understanding the Results

Linked