A simple google query ("intel cpu cache latency") reveals an interesting research of Intel: Measuring Cache and Memory Latency and CPU to Memory Bandwidth. In this paper authors use LMbench to perform the measurements.
How to take Measurements
Use the executable binary file called “lat_mem_rd”
found in the “bin” folder of the utility’s directory. Next, use the following
command line:
taskset 0x1 ./lat_mem_rd –N [x] –P [y] [depth] [stride]
Where [x] equals the number of times the process is run before reporting
latency. Typically setting this to ‘1’ is sufficient for accurate measurements.
For the ‘-P’ option, [y] equals the number of processes invoked to run the
benchmark. The recommendation for this is always ‘1.’ It is sufficient to
measure the access latency with only one processing core or thread. The
[depth] specification indicates how far into memory the utility will measure.
In order to ensure an accurate measurement, specify an amount that will go
far enough beyond the cache so that it does not factor in latency
measurements.
Understanding the Results
Since L1 and L2 cache latency ties to the core clock, CPU frequency plays a role in how
fast memory accesses happen in real time. This means the number of core
clocks stays the same independent of the core frequency. For a comparable
result, it is best to convert the latency given by LMBench from nanoseconds
into CPU clocks. To do this, multiply the latency by the processor frequency.
Time(seconds) * Frequency(Hz) = Clocks of latency
Therefore, if a 2.4 GHz processor takes 17 ns to access a certain level of
cache, this converts to:
17 x 10-18 seconds * 2400000000 Hz = 17 ns * 2.4 GHz ≈ 41 Clocks