0

I spent many time to measure exact clock cycles of given instructions, a portion of code written in C. However, I never could measure exactly how many cycles will take during the runtime, I used PAPI, "rtdsc" hardware counter, clock(), clock_gettime with "CLOCK_REALTIME". However, for every run, I get different numbers, (I don't want to iterate the program over 100000 times and get the average, is not working in my case! even works, the average still is not the precise way in my opinion),

I'm surprised how there is not a precise measurement tool, since we know exactly how many instructions are there, how is possible that we don't have a proper tool to measure!

I profile with oprofile and vtune(which inside uses pref), I know exactly what is going on but how I can measure exactly it? I'm afraid there is no way!

For the sample code below, a call to the Fibonacci function, no way to get the exact number of cycles!

#include <stdio.h>
#include <time.h>
#include <math.h>
#include <stdint.h>

#ifdef PAPI
#include <papi.h>
#endif

uint64_t fib(int n) { return n < 2 ? (uint64_t)n : fib(n-1) + fib(n-2); }

#define rdtscll(val) {                                           \
    unsigned int __a,__d;                                        \
    asm volatile("rdtsc" : "=a" (__a), "=d" (__d));              \
    (val) = ((unsigned long)__a) | (((unsigned long)__d)<<32);   \
}

int main(int argc, char **argv)
{
    struct timespec tv;
    long long start,stop;
#ifdef PAPI
    //  gcc test.c -I/${PAPI_DIR}/include -L/${PAPI_DIR}/lib -O3 -o test -lpapi
    if (PAPI_library_init(PAPI_VER_CURRENT) != PAPI_VER_CURRENT)
        exit(1);
    start = PAPI_get_real_cyc();
    #ifdef math
    fib(10);
    #endif  
    stop = PAPI_get_real_cyc();
    printf("total cycles : %lld\n",stop-start);
#endif  
#ifdef clk
    start = clock();
    #ifdef math
    fib(10);
    #endif  
    stop = clock();
    printf("total cycles : %lld\n",stop-start);
#endif
#ifdef Wtime
    clock_gettime(CLOCK_REALTIME, &tv);
    start= (tv.tv_sec) * 1000000000 + (tv.tv_nsec);
    #ifdef math
    fib(10);
    #endif  
    clock_gettime(CLOCK_REALTIME, &tv);
    stop= (tv.tv_sec) * 1000000000 + (tv.tv_nsec);
    printf("total time : %lld\n",stop-start);
#endif
#ifdef hardware_counter
    rdtscll(start);
    #ifdef math
    fib(10);
    #endif  
    rdtscll(stop);  
    printf("total cycles : %lld\n",stop-start);
#endif
    return 0;
}

I put here the objdump output for some of them, for instance rtdsc:

0000000000400450 <main>:
  400450:   48 83 ec 08             sub    $0x8,%rsp
  400454:   0f 31                   rdtsc  
  400456:   89 c0                   mov    %eax,%eax
  400458:   48 c1 e2 20             shl    $0x20,%rdx
  40045c:   48 09 c2                or     %rax,%rdx
  40045f:   b8 09 00 00 00          mov    $0x9,%eax
  400464:   49 89 d3                mov    %rdx,%r11
  400467:   83 f8 01                cmp    $0x1,%eax
  40046a:   7e 5c                   jle    4004c8 <main+0x78>
  40046c:   44 8d 48 fe             lea    -0x2(%rax),%r9d
  400470:   44 8d 50 fd             lea    -0x3(%rax),%r10d
  400474:   44 8d 40 ff             lea    -0x1(%rax),%r8d
  400478:   44 89 c8                mov    %r9d,%eax
  40047b:   83 e0 fe                and    $0xfffffffe,%eax
  40047e:   41 29 c2                sub    %eax,%r10d
  400481:   44 89 c7                mov    %r8d,%edi
  400484:   e8 37 01 00 00          callq  4005c0 <xfib>
  400489:   41 83 e8 02             sub    $0x2,%r8d
  40048d:   45 39 d0                cmp    %r10d,%r8d
  400490:   75 ef                   jne    400481 <main+0x31>
  400492:   41 83 f9 ff             cmp    $0xffffffff,%r9d
  400496:   44 89 c8                mov    %r9d,%eax
  400499:   75 cc                   jne    400467 <main+0x17>
  40049b:   0f 31                   rdtsc  
  40049d:   89 c0                   mov    %eax,%eax
  40049f:   48 c1 e2 20             shl    $0x20,%rdx
  4004a3:   be 24 0a 40 00          mov    $0x400a24,%esi
  4004a8:   48 09 c2                or     %rax,%rdx
  4004ab:   bf 01 00 00 00          mov    $0x1,%edi
  4004b0:   31 c0                   xor    %eax,%eax
  4004b2:   4c 29 da                sub    %r11,%rdx
  4004b5:   e8 76 ff ff ff          callq  400430 <__printf_chk@plt>
  4004ba:   31 c0                   xor    %eax,%eax
  4004bc:   48 83 c4 08             add    $0x8,%rsp
  4004c0:   c3                      retq   
  4004c1:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  4004c8:   44 8d 48 fe             lea    -0x2(%rax),%r9d
  4004cc:   41 83 f9 ff             cmp    $0xffffffff,%r9d
  4004d0:   44 89 c8                mov    %r9d,%eax
  4004d3:   75 92                   jne    400467 <main+0x17>
  4004d5:   eb c4                   jmp    40049b <main+0x4b>
  4004d7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  4004de:   00 00 

What is the way to measure exactly and consistently how many clock cycles a computer uses between a portion of code with no calls to another library, stdout, etc,?

I appreciate any help,

EDIT: Thanks all for your comments, regarding the framework, I use x86_64 and aarch64 on physical machines running ubuntu 16.04 and ubuntu 18.04. In PAPI case on aarch64 could not passed the validation tests so I skipped PAPI. However, I tried also this topic using "perf" but the same results, perf cycle counter -> https://stackoverflow.com/a/64898073/3101659

I think this is related to the fundamental of computer architecture, in von Neuman machines, using "random memory access", "brach prediction" and other stuff is the reason that we can not predict and achieve the same cycles in our program. If I run this program using HLS on FPGA device I can measure exactly how many cycles uses by the device cycle by cycle, wheras, on conventioal machines like von Neuman based machines, this is not possible.

Elephant88
  • 117
  • 1
  • 12
  • It's important to know what kind of CPU and OS you use. – paladin Apr 17 '21 at 20:52
  • Run a short sequence [usually less than a time slice or clock tick] several times and use the _minimum_ time. This accounts for being time sliced away and/or cold/hot cache effects. Although I've used `rdtsc`, `clock_gettime(CLOCK_MONOTONIC,...)` is often more stable and accounts for the capabilities of the TSC on the CPU (e.g. `constant_tsc`, etc.). You may want to consider `x86` performance H/W such as `PEBS` – Craig Estey Apr 17 '21 at 20:55
  • 2
    Assuming this is running on a normal operating system, hardware interrupts can arrive at any time and throw off your count. You'd need to be running on bare metal with interrupts disabled to avoid that. – Nate Eldredge Apr 17 '21 at 20:57
  • 3
    I think it's worth reevaluating the underlying assumption that a given set of instructions takes a fixed amount of cycles to complete. – Emanuel P Apr 17 '21 at 20:59
  • 1
    I'm assuming `x86`. It can have up to 256 pending instructions in flight. Speculative execution, cache effects, memory fetches and/or execution from other SMT H/W threads can be factors. Thus, it ain't that simple. – Craig Estey Apr 17 '21 at 21:08
  • You should imagine a electrical current flow going through CPU back and forth. The things will get simpler. There are some properties as voltage, amperage. I'm not into English language so I can't explain it well. – user14063792468 Apr 17 '21 at 21:10
  • @CraigEstey Thanks for pointing out this one. I will give it a try. – Elephant88 Apr 18 '21 at 08:57

0 Answers0