1

I have checked several sites where producer put informations about L1, L2, L3, Main Memory access time in nanoseconds or cycles: skylake info

  • Is it possible to calculate that using result from memtest?
  • If not, how is that calculated in then?

I can run external tools, however they do some test using C/Assembler code - is that the only way to do it?

Example output from memtest86:

Intel i7 @ 3.6GHz

CLK/TEMP      3645 mhz   44C
L1 Cache:     64K        291.81 GB/s
L2 Cache:     256K       125.52 GB/s
L3 Cache:     12288K     56.56 GB/s
Memory:       31.8 GB    20.84 GB/s

RAM Info: PC4-25600 DDR4 XMP 3200MHz /  16-18-18-38 / G-Skill INtl F4-3200C
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
KarlR
  • 1,545
  • 12
  • 28
  • Running code on your CPU is pretty obviously the only way to actually measure / microbenchmark its performance... And using assembly is obviously the best way to control what actual machine code is in a microbenchmark. – Peter Cordes May 06 '19 at 22:36
  • I understand that. However I was just wondering if there is any way to estimate those values based on computer info like cpu clock rate, memory speed, cache size, architecture etc. and then test against those values using c/assembler code. – KarlR May 06 '19 at 22:55
  • Oh, well Intel publishes (in their optimization manual) some numbers for L1d latency, and maybe L2. L3 latency depends on ring bus geometry / how many hops away, and gets complicated. Other than that, you can look at other people's microbenchmark results to find out what they measured for CPUs you don't personally have access to. (e.g. https://www.7-cpu.com/). See also [Is there a penalty when base+offset is in a different page than the base?](//stackoverflow.com/q/52351397) for where L1d latency gets tricky between the 4c special case vs. the 5c normal case on Intel. – Peter Cordes May 06 '19 at 22:59
  • Cache latency measured in clock cycles is basically a free parameter that CPU designers can relax if needed, or tighten up when possible. Some workloads are sensitive to especially L1d latency, but for outer caches hit rate from making it bigger is often more valuable than lower latency from keeping it smaller. (And high bandwidth is always nice, but isn't always directly correlated with latency. Fun fact: memory latency is the limiting factor in single-threaded bandwidth on many CPUs: [Why is Skylake so much better than Broadwell-E for ST memory throughput?](//stackoverflow.com/q/39260020) – Peter Cordes May 06 '19 at 23:02
  • @PeterCordes, thanks for such comprehensive answer :) – KarlR May 07 '19 at 11:04

1 Answers1

2

Is it possible to calculate that using result from memtest?

No.

If not, how is that calculated in then?

The source code of the tools used to produce the results shown on https://www.7-cpu.com/ is publicly available, which can be found at https://www.7-cpu.com/utils.html. In particular, the MemLat tool is used to measure the access latency to each level of the memory hierarchy.

The mainstream method for measuring latency is using pointer chasing, where a linked list of 64-byte elements is created and each element is initialized to basically point to another randomly chosen element (to defeat hardware prefetchers). If the total size of the linked list fits in the L1 cache, then by iterating over the list a sufficiently large number of times, an L1 latency can be measured by dividing the total execution time by the number of elements accessed. This microbenchmark can be simplified by disabling hardware prefetchers so that there is no need for randomization. It's recommended to use 1GB pages (or at least 2MB pages) instead of 4KB pages to ensure that the whole list is allocated from a contiguous chunk of physical memory. Otherwise, there is a chance that multiple 4KB pages may be mapped to the same cache sets, causing conflict misses.

The reason that pointer chasing works is that current Intel and AMD processors don't employ value prediction techniques.

There is another way to measure latency. You can use RDTSC/RDTSCP around a single memory access instruction, essentially treating a single memory access as a short elapsed-time event. See: Memory latency measurement with time stamp counter.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95