69

Is there a simple library to benchmark the time it takes to execute a portion of C code? What I want is something like:

int main(){
    benchmarkBegin(0);
    //Do work
    double elapsedMS = benchmarkEnd(0);

    benchmarkBegin(1)
    //Do some more work
    double elapsedMS2 = benchmarkEnd(1);

    double speedup = benchmarkSpeedup(elapsedMS, elapsedMS2); //Calculates relative speedup
}

It would also be great if the library let you do many runs, averaging them and calculating the variance in timing!

Volker Stolz
  • 7,274
  • 1
  • 32
  • 50
Mike
  • 23,892
  • 18
  • 70
  • 90
  • 2
    Great question, this helped me out a lot. – Nick Knowlson Jul 22 '12 at 00:51
  • 1
    Alternatives to in-program timing: http://stackoverflow.com/questions/7456146/is-there-a-better-way-to-benchmark-a-c-program-than-timing – Ciro Santilli OurBigBook.com Jun 29 '15 at 14:01
  • 1
    Big closed linux question: http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux – Ciro Santilli OurBigBook.com Jun 29 '15 at 14:03
  • 1
    Doing similar work twice in the same program might let the compiler optimize between them. Building multiple executables that each microbenchmark a single implementation strategy is safer (but more cumbersome). Having the entire run-time of a program be the benchmark makes it easy to compare perf-counter results from `perf stat`, and means you can use external timing stuff like `time ./a.out` instead of including timing code in your C. That said, timing code in the program lets you avoid timing initialization code. And multiple results from one is simpler. – Peter Cordes Jun 12 '16 at 10:35

5 Answers5

78

Use the function clock() defined in time.h:

startTime = (float)clock()/CLOCKS_PER_SEC;

/* Do work */

endTime = (float)clock()/CLOCKS_PER_SEC;

timeElapsed = endTime - startTime;
Spikatrix
  • 20,225
  • 7
  • 37
  • 83
Gaurav
  • 781
  • 5
  • 2
  • 4
    This should be the accepted answer instead of the Windows specific one! – Simon Dec 20 '17 at 21:58
  • `clock()` returns the CPU time rather than the wall clock time, which may surprise you if you have multiple threads executing code when benchmarking. – neevek Apr 29 '18 at 08:35
  • @neevek But it leads to correct results if you have only one thread, since if you use wallclock time, your system load will influence the benchmark result. If your system is very busy performing background tasks, you get worse benchmark results than if it isn't when using walllclock time but you wiell get the same results using CPU time. – Mecki Sep 01 '20 at 17:16
  • What about the precision ? By executing the benchmarked code N times and dividing the measured time by N, we increase the precision. How do we determine N and the precision ? – chmike Dec 12 '21 at 08:53
51

Basically, all you want is a high resolution timer. The elapsed time is of course just a difference in times and the speedup is calculated by dividing the times for each task. I have included the code for a high resolution timer that should work on at least windows and unix.

#ifdef WIN32

#include <windows.h>
double get_time()
{
    LARGE_INTEGER t, f;
    QueryPerformanceCounter(&t);
    QueryPerformanceFrequency(&f);
    return (double)t.QuadPart/(double)f.QuadPart;
}

#else

#include <sys/time.h>
#include <sys/resource.h>

double get_time()
{
    struct timeval t;
    struct timezone tzp;
    gettimeofday(&t, &tzp);
    return t.tv_sec + t.tv_usec*1e-6;
}

#endif
sheepez
  • 986
  • 1
  • 10
  • 26
Joe
  • 2,008
  • 1
  • 17
  • 25
  • 7
    Wallclock time (as returned by `gettimeofday`) may not be that useful - `clock_gettime(CLOCK_PROCESS_CPUTIME_ID, ...)` will often be what's wanted there. – caf Feb 28 '10 at 11:13
  • 6
    @caf: A program that uses very little CPU time but spends a lot of time doing blocking I/O or waiting for asynchronous I/O can still be perceived by users to be slow. Both CPU time and wall clock time are important. – bk1e Mar 01 '10 at 02:49
  • 11
    Yes, that's why I qualified my comment with the weasel-words "may" and "often" ;) By the way, if wallclock time *is* desired, then `clock_gettime(CLOCK_MONOTONIC, ...)` is a better option, because unlike `gettimeofday` it won't be affected by changes to the system clock during the timing interval. – caf Mar 01 '10 at 02:53
  • In my typical usage, I only care about wall clock time, because I am doing resource intensive things. I'm not sure how clock_gettime works with multi-threading, but that seems like an area where wall clock time is the only accurate measure. – Joe Mar 02 '10 at 03:39
  • 1
    BTW, QueryPerformanceFrequency should not really be called each time. – Joe Sep 10 '14 at 21:42
  • so if a=get_time() before the process to be timed, and b = get_time() after the process, what would (a - b) mean? Is this calculated in seconds? – galois Feb 21 '15 at 23:20
  • @jaska (b-a) would be the number of seconds elapsed for the process measured in your example. (a-b) would be the negative number of seconds (unlikely to be what you wanted). – Joe Feb 23 '15 at 21:50
  • I came here to find a better function for my microbenchmarking, and found this which is virtually line for line identical to what I am already doing. Not sure if I should be disappointed or happy. – Chris Apr 10 '19 at 01:14
5

Benchmark C code easily

#include <time.h>

int main(void) {
  clock_t start_time = clock();

  // code or function to benchmark

  double elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC;
  printf("Done in %f seconds\n", elapsed_time);
}

Easy benchmark of multi-threaded C code

If you want to benchmark multithreaded program you first need to take a closer look at clock:

Description

The clock() function returns an approximation of processor time used by the program.

Return value

The value returned is the CPU time used so far as a clock_t; to get the number of seconds used, divide by CLOCKS_PER_SEC. If the processor time used is not available or its value cannot be represented, the function returns the value (clock_t)(-1)

Hence it is very important to divide your elapsed_time by the number of threads in order to get the execution time of your function:

#include <time.h>
#include <omp.h>

#define THREADS_NB omp_get_max_threads()

#pragma omp parallel for private(i) num_threads(THREADS_NB)
clock_t start_time = clock();

// code or function to benchmark

double elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC;
printf("Done in %f seconds\n", elapsed_time / THREADS_NB); // divide by THREADS_NB!

Example

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#include <omp.h>

#define N 20000
#define THREADS_NB omp_get_max_threads()

void init_arrays(double *a, double *b) {
  memset(a, 0, sizeof(a));
  memset(b, 0, sizeof(b));
  for (int i = 0; i < N; i++) {
    a[i] += 1.0;
    b[i] += 1.0;
  }
}

double func2(double i, double j) {
  double res = 0.0;

  while (i / j > 0.0) {
    res += i / j;
    i -= 0.1;
    j -= 0.000003;
  }
  return res;
}

double single_thread(double *a, double *b) {
  double res = 0;
  int i, j;
  for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
      if (i == j) continue;
      res += func2(a[i], b[j]);
    }
  }
  return res;
}

double multi_threads(double *a, double *b) {
  double res = 0;
  int i, j;
  #pragma omp parallel for private(j) num_threads(THREADS_NB) reduction(+:res)
  for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
      if (i == j) continue;
      res += func2(a[i], b[j]);
    }
  }
  return res;
}

int main(void) {
  double *a, *b;
  a = (double *)calloc(N, sizeof(double));
  b = (double *)calloc(N, sizeof(double));
  init_arrays(a, b);

  clock_t start_time = clock();
  double res = single_thread(a, b);
  double elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC;
  printf("Default:  Done with %f in %f sd\n", res, elapsed_time);

  start_time = clock();
  res = multi_threads(a, b);
  elapsed_time = (double)(clock() - start_time) / CLOCKS_PER_SEC;
  printf("With OMP: Done with %f in %f sd\n", res, elapsed_time / THREADS_NB);
}

Compile with:

gcc -O3 multithread_benchmark.c -fopenmp && time ./a.out

Output:

Default:  Done with 2199909813.614555 in 4.909633 sd
With OMP: Done with 2199909799.377532 in 1.708831 sd

real    0m6.703s (from time function)
chqrlie
  • 131,814
  • 10
  • 121
  • 189
Antonin GAVREL
  • 9,682
  • 8
  • 54
  • 81
  • 2
    Aren't you assuming that all threads can fully utilize all cores all the time? So you're under-estimating the amount of real time if there is any synchronization overhead. If you want real-time, ask for real time (with `clock_gettime`), and test on an idle system. Then you can *compare* against total CPU-seconds of CPU time used during that amount of real time. Or minimize startup overhead and make the benchmark repeat enough to dominate the overall runtime, and `perf stat` your whole program and will do all this for you, including showing `task-clock` and 3.800 CPUs utilized or whatever. – Peter Cordes Mar 25 '21 at 04:11
  • I had to make this assumption because there is no way to know the current number of active threads without slowing down the program ;) It is more to have a fair estimate and it works, it is by no means exact. – Antonin GAVREL Mar 25 '21 at 04:12
  • 1
    You only "had to" if you insist on extrapolating real-time instead of directly measuring it via some slightly-less-portable high rez time source for wall-time. I wouldn't recommend this; like I said, **use a proper real-time clock instead of extrapolating at all**. You're potentially fooling yourself and hiding any serial or less-parallel phases for problems that don't split perfectly into even-sized chunks with uniform amounts of work. (Different OpenMP scheduling options exist to handle this, like dynamic vs. static.) – Peter Cordes Mar 25 '21 at 04:18
  • 1
    `memset(a, 0, sizeof(a))` is incorrect. You should write `memset(a, 0, sizeof(*a) * N)` and `N` should be passed as an argument, although it makes it harder for the compiler to parallelize the code if `N` is a variable quantity. – chqrlie Mar 25 '21 at 10:44
  • 1
    Your approach to OMP timing is questionable: either you are interested in the performance of a single thread of execution and you can get that by not parallelizing the code generation, or you want to assess the efficiency of the OMP code generator and you should report both the wall clock timing **and** the actual number of threads requested and used. Just dividing by that number is removing a crucial piece of information. You should actually multiply the timing by the actual number of threads and compare to the single thread timing to see if OMP is efficient or even useful at all. – chqrlie Mar 25 '21 at 10:54
  • Is there an eady way to get the number of threads requested and used? – Antonin GAVREL Mar 25 '21 at 21:39
1

In POSIX, try getrusage. The relevant argument is RUSAGE_SELF and the relevant fields are ru_utime.tv_sec and ru_utime.tv_usec.

lhf
  • 70,581
  • 9
  • 108
  • 149
0

There may be existing utilities that help with this, but I suspect most will use some kind of sampling or possibly injection. But to get specific sections of code timed, you will probably have to add in calls to a timer like you show in your example. If you are using Windows, then the high performance timer works. I answered a similar question and showed example code that will do that. There are similar methods for Linux.

Community
  • 1
  • 1
Mark Wilkins
  • 40,729
  • 5
  • 57
  • 110