1

I'm trying to calculate number of CPU cycles required to run single ASM instruction. In order to do this, I've created this function:

measure_register_op:
    # Calculate time of required for movl operation

    # function setup
    pushl %ebp
    movl %esp, %ebp
    pushl %ebx
    pushl %edi

    xor %edi, %edi

    # first time measurement
    xorl %eax, %eax
    cpuid               # sync of threads
    rdtsc               # result in edx:eax

    # we are measuring instuction below
    movl %eax, %edi     

    # second time measurement
    cpuid               # sync of threads
    rdtsc               # result in edx:eax

    # time difference
    sub %eax, %edi

    # move to EAX. Value of EAX is what function returns
    movl %edi, %eax

    # End of function
    popl %edi
    popl %ebx
    mov %ebp, %esp
    popl %ebp

    ret

I'm using it in *.c file:

extern unsigned int measure_register_op();

int main(void)
{

    for (int a = 0; a < 10; a++)
    {
        printf("Instruction took %u cycles \n", measure_register_op());
    }

    return 0;
}

The problem is: the values I see are way too large. I'm getting 3684414156 now. What could go wrong here?

EDIT: Changed from EBX to EDI, but result is still similar. It have to be something with rdtsc itself. In debugger I can see that second measurement results with 0x7f61e078 and first 0x42999940, which, after substraction still gives around 1019758392

EDIT: Here is my makefile. Maybe I'm compiling it incorrectly:

compile: measurement.s measurement.c
    gcc -g measurement.s measurement.c -o ./build/measurement -m32

EDIT: Here is an exact result I see:

Instruction took 4294966680 cycles 
Instruction took 4294966696 cycles 
Instruction took 4294966688 cycles 
Instruction took 4294966672 cycles 
Instruction took 4294966680 cycles 
Instruction took 4294966688 cycles 
Instruction took 4294966688 cycles 
Instruction took 4294966696 cycles 
Instruction took 4294966688 cycles 
Instruction took 4294966680 cycles 
Piotrek
  • 10,919
  • 18
  • 73
  • 136

2 Answers2

5

cpuid clobbers ebx and a lot of other registers. You need to refrain from using cpuid here or save the value somewhere it won't be clobbered.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 1
    `lfence` is a good alternative to CPUID for use with `rdtsc`; it's guaranteed to serialize *instruction* execution on Intel (without flushing the store buffer like a full serializing instruction, but that doesn't matter), and also [on AMD with Spectre mitigation enabled](https://stackoverflow.com/questions/51844886/is-lfence-serializing-on-amd-processors). See also http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c – Peter Cordes May 18 '19 at 15:41
  • @R.. although your point was correct, I'm still getting wrong answers. I've changed EBX to EDI since it isn't modified by cpuid – Piotrek May 18 '19 at 15:47
4

In your update version that doesn't clobber the start time (the bug @R. pointed out):

sub %eax, %edi is calculating start - end. This is a negative number, i.e. a huge unsigned number just below 2^32. If you're going to use %u, get used to interpreting its output back to a bit-pattern when debugging.

You want end - start.

And BTW, use lfence; it's significantly more efficient than cpuid. It's guaranteed to serialize instruction execution on Intel (without flushing the store buffer like a full serializing instruction). It's also safe on AMD CPUs with Spectre mitigation enabled.

See also http://akaros.cs.berkeley.edu/lxr/akaros/kern/arch/x86/rdtsc_test.c for some different ways to serialize RDTSC and/or RDTSCP.


See also Get CPU cycle count? for more about RDTSC, especially that it doesn't count core clock cycles, only reference cycles. So idle/turbo will affect your results.

Also, the cost of one instruction isn't one-dimensional. It's not particularly useful to time a single instruction with RDTSC like that. See RDTSCP in NASM always returns the same value for more about how to measure throughput/latency/uops for a single instruction.

RDTSC can be useful for timing a whole loop or longer sequence of instructions, larger than the OoO execution window of your CPU.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847