Benchmarking code - am I doing it right?

Question

I want to benchmark a C/C++ code. I want to measure cpu time, wall time and cycles/byte. I wrote some mesurement functions but have a problem with cycles/byte.

To get a cpu time I wrote a function getrusage() with RUSAGE_SELF, for wall time i use clock_gettime with MONOTONIC, to get cycles/byte I use rdtsc.

I process an input buffer of size, for example, 1024: char buffer[1024]. How do I benchmark:

Do a warm-up phase, simply call fun2measure(args) 1000 times:

for(int i=0; i<1000; i++) fun2measure(args);

Then, do a real-timing benchmark, for wall time:

`unsigned long i; double timeTaken; double timeTotal = 3.0; // process 3 seconds

for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++) fun2measure(args); `
And for cpu time (almost the same):

for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++) fun2measure(args);

But when I want to get a cpu cycle count for function, I use this piece of code:

`unsigned long s = cyclecount();
    for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = walltime(1), i++)
    {
        fun2measure(args);
    }
    unsigned long e = cyclecount();

unsigned long s = cyclecount();
    for (timeTaken=(double)0, i=0; timeTaken <= timeTotal; timeTaken = cputime(1), i++)
    {
        fun2measure(args);
    }
    unsigned long e = cyclecount();`

and then, count cycles/byte: ((e - s) / (i * inputsSize);. Here inputsSize is 1024 because its the length of the buffer. But when I rise totalTime to 10s I ge strange results:

for 10s:

Did fun2measure 1148531 times in 10.00 seconds for 1024 bytes, 0 cycles/byte [CPU]
Did fun2measure 1000221 times in 10.00 seconds for 1024 bytes, 3.000000 cycles/byte [WALL]

for 5s:

Did fun2measure 578476 times in 5.00 seconds for 1024 bytes, 0 cycles/byte [CPU]
Did fun2measure 499542 times in 5.00 seconds for 1024 bytes, 7.000000 cycles/byte [WALL]

for 4s:

Did fun2measure 456828 times in 4.00 seconds for 1024 bytes, 4 cycles/byte [CPU]
Did fun2measure 396612 times in 4.00 seconds for 1024 bytes, 3.000000 cycles/byte [WALL]

My questions:

Are those results ok?
Why when I increase time I always get 0 cycles/byte in cpu?
How can I measure average time, mean, standard deviation etc statistics for such benchmarking?
Is my benchmarking method 100% ok?

CHEERS!

1st EDIT:

After changing i to double:

Did fun2measure 1138164.00 times in 10.00 seconds for 1024 bytes, 0.410739 cycles/byte [CPU]
Did fun2measure 999849.00 times in 10.00 seconds for 1024 bytes, 3.382036 cycles/byte [WALL]

my results seem to be ok. So question #2 isnt a question anymore:)

Be careful to use floating point division when you are calculating cycles/byte — Vaughn Cato, Jul 25 '13 at 14:16
@VaughnCato: why? should I use `i=1` instead? You mean that I probably deal with `zero division error` here? — nullpointer, Jul 25 '13 at 14:17
If you don't use floating point division, then a value less than one will be rounded to zero. — Vaughn Cato, Jul 25 '13 at 14:19
@VaughnCato: issue #2 fixed, many thanks! Could you be able to say something more about other questions? — nullpointer, Jul 25 '13 at 14:24
Also, be careful of `rdtsc`. There are two major problems with it that I've run into (and maybe more): 1) on many multi-CPU systems, the TSC counters are not kept in sync, so getting migrated to a different CPU between start and end points will give bogus results, and 2) the TSC may reliably (more or less) count cycles, but interrupts, reschedules, etc. mean those cycles may have not all been spent in your code... Still, it can be useful as a ballpark estimate as long as you're aware of the possible issues... — twalberg, Jul 25 '13 at 14:37
@twalberg: so you suggest that I shouldnt use `rdtsc`? or use before it a `cpuid` instruction? — nullpointer, Jul 25 '13 at 14:41
@nullpointer I'm not suggesting that you don't use it, just that you make sure you understand its limitations. It's best used for short durations where the chance of being migrated to another CPU or interrupted by something else is minimal, or as just a rough estimate of longer intervals if you have a mostly idle system, and can guarantee either synchronized TSCs or pinning your process to a specific CPU for the duration. — twalberg, Jul 25 '13 at 14:44

score 1 · Accepted Answer · answered Aug 08 '13 at 17:55

Your cyclecount benchmark is flawed as it includes the cost for walltime/cputime function calls. In general though, I strongly urge you to use a proper profiler instead of trying to reinvent the wheel. Especially performance counters will give you numbers that you can rely on. Also note that cycles are very unreliable as the CPU is usually not running at a fixed frequency or the kernel may do a task switch and halt your app for some time.

I personally write benchmarks such that they run a given function N times, for N being large enough such that you get enough samples. Externally then I apply a profiler such as linux perf to get me some hard numbers to reason about. Repeating the benchmark a given time you can then calculate stddev/avg values, which you can do in a script that runs the benchmark a few times and evaluates the output of the profiler.

What exactly is your issue? The formulas for these values you can find on Wikipedia. Just run the benchmarks N times and collect all values. Then enter those values in the corresponding formulas... — milianw, Aug 13 '13 at 09:21

Benchmarking code - am I doing it right?

1 Answers1

Linked