6

Here's perhaps a more advanced question. If you have two functions that return a value,

int F(int input1, int input2)
{
    int output;
    // <Some algorithm that assigns value to output>
    return output;
}

int D(int input1, int input2)
{
    int output;
    // <Another algorithm that assigns value to output>
    return output;
}

With the condition that F(a,b) == D(a,b) (both return the same value for the same inputs).

If you'd like to benchmark their performance, how would you do it? More precisely, how would you isolate the time it takes to perform F(a,b) or D(a,b) such that it does not reflect the time it takes for the other secondary operations in the benchmark setup?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Adl A
  • 183
  • 1
  • 8
  • 1
    I just wrote a minimal benchmark for [another question on SO](http://stackoverflow.com/questions/35717467/rounding-integers-routine/35717890#35717890) this morning. Feel free to use it ;) – YSC Mar 01 '16 at 14:08
  • Thank you @YSC. To tell you the truth, this question is a sort of branch of that question. In the [other question](http://stackoverflow.com/questions/35717467/rounding-integers-routine/35717890#35717890) the benchmark framework seemed to use more computation power than the actual function. That is why in this question i specify the need to `isolate the time it takes to perform F(a,b) or D(a,b) such that it does not reflect the time it takes for the other secondary operations in the benchmark setup`. Cheers! – Adl A Mar 01 '16 at 14:21
  • I'm sorry, I didn't notice you were the same OP :-D – YSC Mar 01 '16 at 14:23
  • 4
    Keep in mind that the performance in a real application can give totally different results than a benchmark (i.e. with a tight loop). – Eiko Mar 01 '16 at 14:24
  • @Eiko thank you for the heads-up. I'll keep that in mind. Cheers! – Adl A Mar 01 '16 at 14:26

1 Answers1

11

One of the best available opensource solutions is Google Benchmark.

You have to create simple wrappers around code you want to benchmark and link either statically or dynamically with the benchmark library. It is often useful to have such micro benchmarks compiled near with your code. For inspiration see awesome presentation.

static void BM_F(benchmark::State& state) {
  const auto input1 = state.range_x();
  const auto input2 = state.range_y();

  while (state.KeepRunning()) F(input1, input2);
}

static void BM_D(benchmark::State& state) {
  const auto input1 = state.range_x();
  const auto input2 = state.range_y();

  while (state.KeepRunning()) D(input1, input2);
}

BENCHMARK(BM_F)
    ->ArgPair(1, 10)
    ->ArgPair(10, 100)
    ->ArgPair(100, 1000);

BENCHMARK(BM_D)
    ->ArgPair(1, 10)
    ->ArgPair(10, 100)
    ->ArgPair(100, 1000);

If you want to measure raw CPU cycles, then your only choice is to use direct CPU instructions. For x86 you can use Time Stamp Counter.

But you should be aware, that such measuring will not resist any context switches performed by OS or jumping on CPUs. Your only choice in such situations will be to use an algorithm with a single flow of execution. Remember the ID of the CPU and the TSC value before entering to test function, and check the ID of the CPU after the test function. Then calculating the difference between TSC values. You may also set up CPU affinity for your process to stick the process to a specific CPU.

Another Linux-specific possible way to benchmark functions is to use perf tool.

But in any case, any measurement will add some error level to the result.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • thanks for the suggestion. I'll give it a try. Cheers! – Adl A Mar 01 '16 at 14:24
  • after some research I've concluded that it is simply a statistical shortcut. It does not address the one of my main interests in the topic which is the need to `isolate the time it takes to perform F(a,b) or D(a,b) such that it does not reflect the time it takes for the other secondary operations in the benchmark setup`. Many thanks nevertheless! Cheers! – Adl A Mar 01 '16 at 18:50
  • well, you should describe your intention in details. There is no any way to measure something without at least error level of your measure device. What do you want to get? And why this "statistical shortcut" doesn't feat you? –  Mar 01 '16 at 20:45
  • One reason is compiler optimization, which tends to skip functions without side effects, but as soon as you introduce side effects you introduce secondary operations that are taken into account. Another reason is that the chronometric operations (recording start time and end time) are not negligible and in fact seem to take longer than simple arithmetic operations. – Adl A Mar 02 '16 at 08:57
  • 1
    If you want forbid compiler to throwaway functions without observable side effect you can use compiler barrier. BTW, google/benchmark provides one: benchmark::DoNotOptimize(in fact you can write you own in one inline assembler string). Presentation I mentioned earlier contains description of this technique. And, if classic chrono operations are too cost for you(syscalls/vsyscalls), then definitely your only choise is to use some hardware counters. For start you can use previously mentioned TSC. –  Mar 02 '16 at 10:04
  • `rdtsc` doesn't measure core clock cycles. It measures "reference" cycles, and always ticks at the branded CPU frequency, regardless of turbo / powersave. You can use performance counters instead to count core clock cycles (and uops, and cache misses, etc.) See for example my answer on https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all where I micro-benchmarked some loops with `perf`. You can do that with C instead of asm. – Peter Cordes Nov 09 '17 at 15:30