Accurate memory access time probing with RDTSC and RDTSCP?

Question

I'm trying to make an accurate measurement of memory access to different cache levels, and came up with this code for probing:

__asm__ __volatile__(
        "xor %%eax, %%eax   \n"
        "xor %%edi, %%edi   \n"
        "xor %%edx, %%edx   \n"
        /* time measurement */
        "lfence              \n"
        "rdtsc              \n"
        "shl $32, %%rdx        \n"
        "or %%rdx, %%rax    \n"
        "movq %%rax, %%rdi  \n"
        /* memory access */
        "movq (%%rsi), %%rbx\n"
        /* time measurement */
        "rdtscp              \n"
        "shl $32, %%rdx     \n"
        "or %%rdx, %%rax    \n"
        "movq %%rax, %%rsi  \n"
        "cpuid              \n"
        : /* output operands */
        "=S"(t2), "=D"(t1)
        : /* input operands */
        "S" (mem)
        : /* clobber description */
        "ebx", "ecx", "edx", "cc", "memory"
    );

However the L1 and L2 cache access just differ by 8 cycles and the results are fluctuating to much, so I decided to check how much impact the surrounding code (apart from the actual memory access) has on the timing:

    __asm__ __volatile__(
        "xor %%eax, %%eax   \n"
        "xor %%edi, %%edi   \n"
        "xor %%edx, %%edx   \n"
        /* time measurement */
        "lfence             \n"
        "rdtsc              \n"
        "shl $32, %%rdx        \n"
        "or %%rdx, %%rax    \n"
        "movq %%rax, %%rdi  \n"
        /* memory access */
        //"movq (%%rsi), %%rbx\n"
        /* time measurement */
        "rdtscp              \n"
        "shl $32, %%rdx     \n"
        "or %%rdx, %%rax    \n"
        "movq %%rax, %%rsi  \n"
        "cpuid              \n"
        : /* output operands */
        "=S"(t2), "=D"(t1)
        : /* input operands */
        "S" (mem)
        : /* clobber description */
        "ebx", "ecx", "edx", "cc", "memory"
    );

The results looked like this:

./cache_testing
From Memory: 42
From L3: 46
From L2: 40
From L1: 38

./cache_testing
From Memory: 40
From L3: 38
From L2: 36
From L1: 40

I'm aware that I don't hit the different cache levels by purpose at the moment, but I wonder why the timing, in case of the missing memory access is fluctuating so much. The code is running as SCHED_FIFO with the highest priority, pinned to one CPU and shouldn't be dispatched while running. Can anybody tell me if I can improve my code and thereby the results in any way?

The correct numbers for cache load->use latency on Intel Haswell are 4c for L1, 12c for L2, according to [Agner Fog's microarch pdf](http://agner.org/optimize/). A great way to measure this (especially for L1) is pointer-chasing. For L1, just set a pointer to point to itself, and run `mov (%rax), %rax` in a loop. For L2, you need a big linked list that doesn't fit in L1. — Peter Cordes, Jun 14 '16 at 10:08
Related [clflush to invalidate cache line via C function](https://stackoverflow.com/a/51830976) has an answer with details on lfence+rdtsc+lfence. — Peter Cordes, Aug 18 '18 at 14:25

score 2 · Answer 1 · answered Jun 14 '16 at 10:32

To fix your measuring code, you're right that you need to measure an empty setup as a baseline to subtract the measurement overhead.

Also keep in mind that the TSC counts reference cycles, not core clock cycles, so for this to work you need to make sure your CPU is always running at the same speed. (e.g. disable turbo and use a warm-up loop to get the CPU up to top speed, then TSC counts should match core cycles if you aren't overclocking.)

That probably explains the fluctuation.

I usually measure stuff with perf counters, not RDTSC.

But I think you should be using a serializing instruction (like CPUID) before the first RDTSC. Using a CPUID after the second RDTSC probably isn't useful. rdstcp for the second measurement is useful, since it means the timestamp comes from after the load has executed. (The manual says "executed"; IDK if that means "retired" or just literally executed by the a load port.)

So IIRC, your best bet is:

 # maybe set eax to something before CPUID
 cpuid
 rdtsc
 shl  $32, %%rdx
 lea  (%%rax, %%rdx),  %%rsi

 ... code under test

 # CPUID here, too, if you can only use rdtsc instead of rdtscp
 rdtscp
 shl  $32, %%rdx
 or   %%rdx, %%rax
 sub  %%rsi, %%rax
 # time difference in RAX

If the code under test competes for the same ALU ports as shift/LEA, you could just mov the low 32 of the first RDTSC result to another register. Instead of dealing with the high 32 at all. If you assume that the difference in timestamps is much less than 2^32, you don't need the high 32 bits of either count.

I've read that measuring tiny sequences like this on modern CPUs can be done better with performance counters than with the TSC. Agner Fog's test programs include code for using perf counters from inside a program to measure something. This can let you measure core cycles regardless of turbo or non-turbo, because the core clock cycles performance-counter actually counts at one per physical clock cycle.

Update: `lfence; rdtsc` does serialize it, and is more efficient than CPUID. — Peter Cordes, Jan 29 '18 at 14:26

Accurate memory access time probing with RDTSC and RDTSCP?

1 Answers1