3

I am bench-marking some example functions on my processsor, each core running at 2 GHz. Here are the functions being bench-marked. Also, available on quick-bench

#include <stdlib.h>
#include <time.h>
#include <memory>

class Base
{
  public:       
   virtual int addNumVirt( int x ) { return (i + x); }
   int addNum( int x ) { return (x + i); }
   virtual ~Base() {}

  private:
   uint32_t i{10};
};

class Derived : public Base
{
  public:
   // Overrides of virtual functions are always virtual
   int addNumVirt( int x ) { return (x + i); }
   int addNum( int x ) { return (x + i); }

  private:
   uint32_t i{20};
};

static void BM_nonVirtualFunc(benchmark::State &state)
{
 srand(time(0));
 volatile int x = rand();
 std::unique_ptr<Derived> derived = std::make_unique<Derived>();
 for (auto _ : state)
 {
   auto result = derived->addNum( x );
   benchmark::DoNotOptimize(result);
 }
}
BENCHMARK(BM_nonVirtualFunc);

static void BM_virtualFunc(benchmark::State &state)
{
 srand(time(0));
 volatile int x = rand();
 std::unique_ptr<Base> derived = std::make_unique<Derived>();
 for (auto _ : state)
 {
   auto result = derived->addNumVirt( x );
   benchmark::DoNotOptimize(result);
 }
}
BENCHMARK(BM_virtualFunc);

static void StringCreation(benchmark::State& state) {
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    std::string created_string("hello");
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(created_string);
  }
}
// Register the function as a benchmark
BENCHMARK(StringCreation);

static void StringCopy(benchmark::State& state) {
  // Code before the loop is not measured
  std::string x = "hello";
  for (auto _ : state) {
    std::string copy(x);
  }
}
BENCHMARK(StringCopy);

Below are the Google-benchmark results.

Run on (64 X 2000 MHz CPU s)
CPU Caches:
  L1 Data 32K (x32)
  L1 Instruction 64K (x32)
  L2 Unified 512K (x32)
  L3 Unified 8192K (x8)
Load Average: 0.08, 0.04, 0.00
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_nonVirtualFunc      0.490 ns        0.490 ns   1000000000
BM_virtualFunc         0.858 ns        0.858 ns    825026009
StringCreation          2.74 ns         2.74 ns    253578500
BM_StringCopy           5.24 ns         5.24 ns    132874574

The results show that the execution time is 0.490 ns and 0.858 ns for the first two functions. However, what I do not understand is if my core is running at 2 GHz, this means one cycle is 0.5 ns, which makes the result seem unreasonable.

I know that the result shown is an average over the number of iterations. And such low execution time means that most of the samples are below 0.5 ns.

What am I missing?

Edit 1: From the comments, it seems like adding a constant i to x was not a good idea. In fact, I started with calling std::cout in the virtual and non-virtual functions. This helped me in understanding that virtual functions are not inlined and the call needs to be resolved at run-time.

However, having outputs in the functions being bench-marked does not look nice on the terminal. (Is there a way to share my code from Godbolt?) Can anyone propose an alternative to printing something inside the function?

talekeDskobeDa
  • 372
  • 2
  • 13
  • 3
    Most likely the optimizer did its job on your function – stark Nov 29 '19 at 12:42
  • 4
    Are you aware of [superscalar execution](https://en.wikipedia.org/wiki/Superscalar_processor) support in modern CPUs? Or are you just surprised that the compiler reduced your `nonVirtualFunc` to a single `add` instruction (see the assembly in the quick-bench link)? – Max Langhof Nov 29 '19 at 12:42
  • 7
    One cycle doesn't mean it runs one instruction. Your CPU does _a lot_ of things in a single cycle. – tkausl Nov 29 '19 at 12:43
  • 1
    @stark I have called the function benchmark::DoNotOptimize() Do you mean this is not enough for the compiler? – talekeDskobeDa Nov 29 '19 at 12:44
  • 1
    Your CPU does not run only one instruction per clock cycle. You can take a look at [this](https://en.wikipedia.org/wiki/Instructions_per_cycle). – Fareanor Nov 29 '19 at 12:45
  • What do you mean "not enough"? It preserved the variable that you told it to. – stark Nov 29 '19 at 12:51
  • 2
    @talekeDskobeDa That just means "compiler, please don't reduce my code to a no-op because it doesn't actually do anything", not "compiler, please don't perform any optimizations". – Max Langhof Nov 29 '19 at 12:51
  • To be sure what code you're actually benchmarking, take a look at the generated assembly. – Evg Nov 29 '19 at 12:55
  • Include at least the important / relevant part of your code in the question itself, not just a link to it. SO questions are supposed to be at least mostly self-contained. – Peter Cordes Nov 29 '19 at 14:40

1 Answers1

1

Modern compilers just do magnificent things. Not always the most predictable things, but usually good things. You can see that either by watching the ASM as suggested, or by reducing the optimization level. Optim=1 makes the nonVirtualFunc equivalent to virtualFunc in terms of CPU time and optim=0 raises all your function to a similar level (Edit: of course in a bad way; do not do that to actually take performance conclusions).

And yeah, when I first used QuickBench I was confused by "DoNotOptimize" as well. They could better have called it "UseResult()" to signalize what it's actually intended to pretend when benchmarking.

AlexGeorg
  • 967
  • 1
  • 7
  • 16
  • 1
    Disabling optimization entirely for benchmarking is *terrible* advice; that's not a valid suggestion. Your code will have *different* bottlenecks from storing/reloading all variables between C++ statements. (Store-forwarding latency). This creates all kinds of weird effects, like [Adding a redundant assignment speeds up code when compiled without optimization](//stackoverflow.com/q/49189685). See also [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](//stackoverflow.com/q/53366394) and [this](https://stackoverflow.com/a/32001196/224132). – Peter Cordes Nov 29 '19 at 13:42
  • Use at least `-Og` or `-O1`, but better to use at least `-O2`, preferably `-O3`, and careful use of inline asm macros like `DoNotOptimize` to force the compiler to materialize a value in a register, and/or to forget about a value so it has to recompute something based on it. e.g. as part of a repeat loop to make something take a measurable amount of time. Keep in mind whether you're measuring latency (loop carried dependency) or throughput (independent, with instruction-level parallelism (ILP)) because for superscalar out-of-order CPUs single instructions have throughput != latency. – Peter Cordes Nov 29 '19 at 13:47
  • @PeterCordes Ehm, nowhere I was suggesting he should disable optimization and then take conclusion in regard of performance from that! Yet he can use that toggle to get an understanding of why the code is performing surprisingly well with optimizations. In general however I'd say it's not making much sense to bench small statements like those that take a couple of nanoseconds. You have a much higher chance to actually optimize code in your project if you benchmark some major computations as a whole. – AlexGeorg Nov 29 '19 at 13:55
  • You mentioned *`optim=0` raises all your function to a similar level.* without adding any red flags like "(Never do this, the results don't tell you anything about performance when compiling normally)". It's not rare to see people get the mistaken idea that disabling optimization so their benchmark doesn't optimize away is better than nothing. It isn't, so combating this misconception is a good idea IMO. – Peter Cordes Nov 29 '19 at 13:59
  • Honestly as I even wrote "by _reducing_ the optimization level", I have a hard time bending my mind to missunderstand my post but as you say... I'll briefly edit. – AlexGeorg Nov 29 '19 at 14:01
  • Anyway yes, a single simple C statement like `x = y+z` is far too small to benchmark *in C*. Depending on context it might optimize into other statements, and even in asm one single number isn't sufficient to characterize the performance of an `add` instruction. The question doesn't include the actual code so I didn't see what was being benchmarked. – Peter Cordes Nov 29 '19 at 14:02
  • From the POV of someone who doesn't understand compilers and benchmarking, it's easy to imagine reading your answer and saying "oh, reducing optimization gives meaningful results because optimization was the problem for my benchmark". We do sometimes get SO questions about benchmark results from people who followed this train of thought, like the one I linked in my first comment. And answers to questions that try to show their way is fast by benchmarking with no optimization. This is a real misunderstanding that exists; thanks for editing. – Peter Cordes Nov 29 '19 at 14:06
  • @AlexGeorg If you take a look at the quick-bench link, the optimization flag is `-O3`. So, yeah I am enabling full optimization, which most likely resulted in the `add` instruction for the non-virtual function – talekeDskobeDa Nov 29 '19 at 14:29
  • @PeterCordes The code is available at this [link](http://quick-bench.com/T64f54l0LEFVn7U80A6XHss8k_4) Also, linked in the question – talekeDskobeDa Nov 29 '19 at 14:32