Why are the relative performance results in Google Benchmark completely different from raw loops?

Question

I used Google Benchmark in the following way:

struct MyFixture : public benchmark::Fixture {

  void SetUp(const ::benchmark::State& state) {
      // do setup
  }

  void TearDown(const ::benchmark::State& state) {
  }
};


BENCHMARK_DEFINE_F(MyFixture, Test1)(benchmark::State& st) {

    for (auto _ : st) {
        //algorithm 1;
    }
}
BENCHMARK_REGISTER_F(MyFixture, Test1)->Arg(8);
BENCHMARK_DEFINE_F(MyFixture, Test2)(benchmark::State& st) {

    for (auto _ : st) {
        //algorithm 2
    }
}
BENCHMARK_REGISTER_F(MyFixture, Test2)->Arg(8);

I then wrote raw loop in the following way:

struct MyFixture {

  void SetUp(int n = 8) {
      // do setup
  }

  void TearDown() {
  }
};

int main() {
   double totalCount = 0;

   for (int i = 0; i < 1000000; i++) {
       MyFixture f;
       f.SetUp(8);
       
       auto start = std::chrono::high_resolution_clock::now();
       //algorithm 1
       auto end = std::chrono::high_resolution_clock::now();
       totalCount += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
   }

   // print totalcount

   totalCount = 0;
   for (int i = 0; i < 1000000; i++) {
       MyFixture f;
       f.setup(8);
       
       auto start = std::chrono::high_resolution_clock::now();
       //algorithm 2
       auto end = std::chrono::high_resolution_clock::now();
       totalCount += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
   }

   // print totalCount

   return 0;
}

The result in Google Benchmark and in raw loops are completely different.

In Google Benchmark, algorithm 1 is 8 times faster than algorithm 2.

However, in raw loops, algorithm 1 is 3 times slower than algorithm 2.

What are the possible reasons for this? Which result shall I trust?

I should trust the raw loop version right? So what's possibly wrong with the Google Benchmark ? (Or am I using the Google Benchmark correctly?)

Thanks.

Benchmarking C++ code is a complicated affair. What compiler flags and optimization level are you using? It's possible your raw loop attempt is totally optimizing one of the loops away or something. Google Benchmark takes pains to prevent that sort of thing from happening. — Miles Budnek, Oct 08 '22 at 10:04
That's not the case here, in Google Benchmark I made sure DoNotOptimize is called, and in my loops, I actually printed out the sum of the final results to make sure they were actually and correctly called. There's really some mysteries in Google Benchmark because I do think that the result of my raw loop is probably more reasonable, and the result of Google Benchmark is unreasonable most of the time for my test cases. — Chou Tan, Oct 08 '22 at 11:01
I tried switching the order of the test cases in Google Benchmark and results were the same, it means that the results out of Google Benchmark is very stable, which makes me quite curious why (for example possible cache misses?) — Chou Tan, Oct 08 '22 at 11:03
Since we cannot see what algorithm 1 and algorithm 2 actually are, we can only offer wild guesses and fruitless speculation, but a whole lot of those. — n. m. could be an AI, Oct 08 '22 at 11:09
@n.1.8e9-where's-my-sharem. The actual algorithm is hard to explain. It's something like adol-c library (automatic differentiation) as in https://github.com/coin-or/ADOL-C/tree/stable/2.0/adolc, one is the original adol-c algorithm, and one with optimization (at least tries to). Algo1 and Algo2 are all decimal calculations (same calculation over all), and memory allocation and memory arrays copying around. It really bothers me because one says the optimization is better and one says the original is better... — Chou Tan, Oct 08 '22 at 11:52
The only way to tell for sure what's going on is to look at the assembly the compiler generated in each case, and to do that we would need to see the entire program as well as to know exactly what compiler and compiler flags you used to compile it. — Miles Budnek, Oct 08 '22 at 12:55
Microbenchmarking is hard, and we already have a Q&A about generic gotchas ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)). Without a [mcve] including details on compiler version/options, and/or disassembly of the loops you're testing, we can't say anything specific about whether you're measuring anything realistic in any of your versions, so this question is unanswerable beyond those generalities. (That's why I closed it.) — Peter Cordes, Oct 08 '22 at 17:40

Why are the relative performance results in Google Benchmark completely different from raw loops?

0 Answers0