5

I am using RDTSC and RDTSCP in NASM to measure machine cycles for various assembly language instructions to help in optimization.

I read "How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures" by Gabriele Paoloni at Intel (September 2010) and other web resources (most of which were examples in C).

Using the code below (translated from C), I test various instructions, but RDTSCP always returns zero in RDX and 7 in RAX. I first thought 7 is the number of cycles, but obviously not all instructions take 7 cycles.

rdtsc
cpuid
addsd xmm14,xmm1 ; Instruction to time
rdtscp
cpuid

This returns 7, which is not surprising because on some architectures addsd is 7 cycles with latency included. The first two instructions can (according to some) be reversed, cpuid first then rdtsc, but that makes no difference here.

When I change the instruction to a 2-cycle instruction:

rdtsc
cpuid
add rcx,rdx ; Instruction to time
rdtscp
cpuid

This also returns 7 in rax and zero in rdx.

So my questions are:

  1. How do I access and interpret the values returned in RDX:RAX?

  2. Why does RDX always return zero, and what is it supposed to return?

UPDATE:

If I change the code to this:

cpuid
rdtsc
mov [start_time],rax
addsd xmm14,xmm1 ; INSTRUCTION
rdtscp
mov [end_time],rax
cpuid
mov rax,[end_time]
mov rdx,[start_time]
sub rax,rdx

I get 64 in rax, but that sounds like too many cycles.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
RTC222
  • 2,025
  • 1
  • 20
  • 53
  • What's with the `cpuid` instructions? You are aware that they modify registers, right? – cHao Feb 10 '19 at 21:51
  • cpuid is a serializing instruction to wait for all earlier instructions to finish and to prevent later instructions from starting before the TSC is read. – RTC222 Feb 10 '19 at 21:54
  • Pick an instruction that won't clobber your results...? – cHao Feb 10 '19 at 21:55
  • According to all the resources, including Paoloni, CPUID is the one to use. If I access rax before the final call to CPUID, the result is a very large integer. – RTC222 Feb 10 '19 at 21:59
  • However, please see the update to my question above. – RTC222 Feb 10 '19 at 22:10
  • 5
    Listen: ***CPUID overwrites eax, ecx, and edx***. So when you execute CPUID, the timing info from RDTSCP is gone and replaced by some numbers that have nothing to do with time, but are telling you about your CPU type instead. – Nate Eldredge Feb 10 '19 at 22:15
  • I reordered my code as shown in the update to my question above (I now take rax before the final call to CPUID, not after) but now I get 64, which is more like it but seems like too many cycles for an ADDDSD. – RTC222 Feb 10 '19 at 22:17
  • Well, it's also counting the time to store the contents of `eax` in memory, which could conceivably be slow. Particularly since `rdtscp` will apparently wait for the store to be globally visible with a memory fence. What if you save that value in some other register instead of in memory? – Nate Eldredge Feb 10 '19 at 22:17
  • I'll calculate that (as best I can) and subtract it from the result. Alternatively, I access rax the same way for every metric I time, so I expect they will all be higher by approx. the same amount, which is OK for my purposes here. For example, add rcx,rdx returns 61, which is (I'm guessing) 3 cycles faster than addsp . – RTC222 Feb 10 '19 at 22:19
  • Using a register instead of memory lowers the number (to 36 and 39) but I don't want to use registers (I'm optimizing) so I can use a memory location with the adjustment I mentioned. However, I'm still puzzled why the numbers are so high. Am I interpreting the values returned in RDX:RAX correctly? – RTC222 Feb 10 '19 at 22:30

1 Answers1

11

Your first code (leading to the title question) is buggy because it overwrites the rdtsc and rdtscp results with the cpuid results in EAX,EBX,ECX and EDX.

Use lfence instead of cpuid; on Intel since forever and AMD with Spectre mitigation enabled, lfence will serialize the instruction stream and thus do what you want with rdtsc.


Remember that RDTSC counts reference cycles, not core clock cycles. Get CPU cycle count? for that and more about RDTSC.

You don't have cpuid or lfence inside your measurement interval. But you do have rdtscp itself in the measurement interval. Back-to-back rdtscp is not fast, 64 reference cycles sounds totally reasonable if you ran without warming up the CPU. Idle clock speed is usually a lot slower than a reference cycle; 1 reference cycle is equal or close to the "sticker" frequency, e.g. max non-turbo sustained frequency, on Intel CPUs. e.g. 4008 MHz on a "4GHz" Skylake CPU.


This is not how you time a single instruction

What matters is latency before another instruction can use the result, not latency until it fully retires from the out-of-order back-end. RDTSC can be useful for timing relative variations in how long one load or one store instruction takes, but the overhead means you won't get a good absolute time.

You can try to subtract measurement overhead, though. e.g. clflush to invalidate cache line via C function. And see also the followups: Using time stamp counter and clock_gettime for cache miss and Memory latency measurement with time stamp counter.


This is what I usually use to profile latency or throughput (and uops fused and unfused domain) of an instruction of short block. Adjust how you use it to bottleneck on latency like here, or not if you want to just test throughput. e.g. with a %rep block with enough different registers to hide latency, or break dependency chains with a pxor xmm3, xmm3 after a short block and let out-of-order exec work its magic. (As long as you don't bottleneck on the front-end.)

You might want to use NASM's smartalign package, or use YASM, to avoid a wall of single-byte NOP instructions for the ALIGN directive. NASM defaults to really stupid NOPs even in 64-bit mode where long-NOP is always supported.

global _start
_start:
    mov   ecx, 1000000000
; linux static executables start with XMM0..15 already zeroed
align 32                     ; just for good measure to avoid uop-cache effects
.loop:
    ;; LOOP BODY, put whatever you want to time in here
    times 4   addsd  xmm4, xmm3

    dec   ecx
    jnz   .loop

    mov  eax, 231
    xor  edi, edi
    syscall          ; x86-64 Linux sys_exit_group(0)

Run this with something like this one-liner that links it into a static executable and profiles it with perf stat, which you can up-arrow and re-run every time you change the source:

(I actually put the nasm+ld + optional disassemble into a shell script called asm-link, to save typing when I'm not profiling. Disassembling makes sure that what's in your loop is what you meant to profile, especially if you have some %if stuff in your code. And also so it's on your terminal right before the profile, if you want to scroll back while testing theories in your head.)

t=testloop; nasm -felf64 -g "$t.asm" && ld "$t.o" -o "$t" &&  objdump -drwC -Mintel "$t" &&
 taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread -r4 ./"$t"

Result from i7-6700k at 3.9GHz (current perf has a unit-scaling display bug for the secondary column. It's fixed upstream but Arch Linux hasn't updated yet.):

 Performance counter stats for './testloop' (4 runs):

          4,106.09 msec task-clock                #    1.000 CPUs utilized            ( +-  0.01% )
                17      context-switches          #    4.080 M/sec                    ( +-  5.65% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.487 M/sec                  
    16,012,778,144      cycles                    # 3900323.504 GHz                   ( +-  0.01% )
     1,001,537,894      branches                  # 243950284.862 M/sec               ( +-  0.00% )
     6,008,071,198      instructions              #    0.38  insn per cycle           ( +-  0.00% )
     5,013,366,769      uops_issued.any           # 1221134275.667 M/sec              ( +-  0.01% )
     5,013,217,655      uops_executed.thread      # 1221097955.182 M/sec              ( +-  0.01% )

          4.106283 +- 0.000536 seconds time elapsed  ( +-  0.01% )

On my i7-6700k (Skylake), addsd has 4 cycle latency, 0.5c throughput. (i.e. 2 per clock, if latency wasn't the bottleneck). See https://agner.org/optimize/, https://uops.info/, and http://instlatx64.atw.hu/.

16 cycles per branch = 16 cycles per chain of 4 addsd = 4 cycle latency for addsd, reproducing Agner Fog's measurement of 4 cycles to better than 1 part in 100 even for this test that includes a tiny bit of startup overhead, and interrupt overhead.

Take your pick of different counters to record. Adding a :u, like instructions:u to a perf even will count only user-space instructions, excluding any that ran during interrupt handlers. I usually don't do that, so I can see that overhead as part of the explanation for wall-clock time. But if you do, cycles:u can match very closely with instructions:u.

-r4 runs it 4 times and averages, which can be useful to see if there's a lot of run-to-run variation instead of just getting one average from a higher value in ECX.

Adjust your initial ECX value to make the total time about 0.1 to 1 second, that's usually plenty, especially if your CPU ramps up to max turbo very quickly (e.g. Skylake with hardware P-states and a fairly aggressive energy_performance_preference). Or max non-turbo with turbo disabled.

But this counts in core clock cycles, not reference cycles, so it still gives the same result regardless of CPU frequency changes. (+- some noise from stopping the clock during the transition.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks for the detailed answer. I'm on Windows so I'm doing some additional research so I can post a follow-up question if needed. – RTC222 Feb 11 '19 at 18:15
  • 1
    @RTC222: Then call Windows exit function instead of making a Linux system call. And use your choice of PMU-counter profiler on Windows instead of Linux `perf`. Or just look at wall-clock time for a known CPU frequency, if you can ensure that (e.g. by disabling turbo.) Usually latency is in whole numbers of cycles, so it's very easy. Throughput can be harder if there are weird effects, or front-end complications. – Peter Cordes Feb 12 '19 at 00:30