0

I have two implementations of a function and would like to see which is faster. Call them foo1() and foo2(). I have a set of test cases I would like to run against both of them. The tests cases are stored in an array and I don't want to include the time spent accessing the array. Here is my code. It doesn't compile on duration += std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ) with error message error: no match for 'operator<<' (operand types are 'std::basic_ostream<char>' and 'std::chrono::microseconds' {aka 'std::chrono::duration<long long int, std::ratio<1, 1000000> >'}) 52 | std::cout << "duration: " << duration << std::endl;.

Code:

/*[first argument, second argument, correct result]*/
 int inout[5][3] = {
    {10, 4, 5},
    {21, 4, 6},
    {22, 4, 7},
    {50, 5, 19},
    {100, 5, 7},
    //just example, real one is longer


};

std::chrono::microseconds duration = std::chrono::microseconds::zero();
for(auto& i : inout)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    foo1(i[0], i[1]);
    auto t2 = std::chrono::high_resolution_clock::now();
    duration += std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();//this won't compile
}
std::cout << "duration of foo1(): " << duration << std::endl;

duration = std::chrono::microseconds::zero();
for(auto& i : inout)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    foo2(i[0], i[1]);
    auto t2 = std::chrono::high_resolution_clock::now();
    duration += std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();//this won't compile
}
std::cout << "duration of foo2(): " << duration << std::endl;

Any suggestions on anything here? I was thinking it's more meaningful to test the speed for cache misses so should the calls to foo1() and foo2() be interlaced? OTOH the goal is to see which is faster and both would benefit from the cache, however the bigger one would likely have more cache misses. Also out of curiosity what is the type on auto& i in the foreach loop?

northerner
  • 756
  • 5
  • 21
  • You should read the error message which tells you that it doesn't find an overload for `operator<<`. So you have to do `std::cout << duration.count();`. – overseas Apr 13 '20 at 11:04
  • @overseas that produces another error – northerner Apr 13 '20 at 11:11
  • You're right. The problem is that your code did not compile (in the sense that it has syntax errors)! You should really post codes that do compile (ideally including a wrapper). – overseas Apr 13 '20 at 11:20
  • 1
    Unless `foo1()(i[0], i[1]);` takes several microseconds (tens of thousands of clock cycles), you probably want to put the timing outside the loop to reduce measurement overhead. But you're calling it with two scalar args so that seems unlikely. Also, apparently `foo1()` is a function or macro that returns a function pointer? You have an extra set of empty parens vs. a more normal `foo1(i[0], i[1])`. – Peter Cordes Apr 13 '20 at 13:41
  • @PeterCordes thanks for pointing that out, typo fixed! I'm concerned that if the return value isn't used, the optimizer may do something to mess up the timing like not call the function. Is this a likely possibility? I'm using g++ – northerner Apr 13 '20 at 13:47
  • Microbenchmarking is hard, get used to it :/ "not calling the function" is not a possibility unless you declared it `__attribute__((pure))`; the compiler can't know that it doesn't have visible side effects like calling printf.... Unless the definition is visible, then it already can inline and optimize away the function body. (Or optimize work across iterations, hosting some setup out of a loop). But if your real use case is iterating this function over an array with the caller handling loading args, benchmark that. – Peter Cordes Apr 13 '20 at 14:09
  • Anyway, you sometimes need inline asm to defeat the optimizer in strategic places, or assign a result to a `volatile int sink` if the cost of an extra store is negligible vs. what the compiler has to do when forced to produce it at all. [I don't understand the definition of DoNotOptimizeAway](https://stackoverflow.com/q/52203710) (Compiling with optimization disabled is 100% pointless; that introduces different bottlenecks and different code slows down by a different amount.) See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for pitfalls – Peter Cordes Apr 13 '20 at 14:09
  • @PeterCordes what I'm trying to benchmark is [this specialized modulus algorithm](https://en.wikipedia.org/wiki/Lucas%E2%80%93Lehmer_primality_test#Time_complexity). So far it seems to actually run slower than the default one, which is hard to believe. – northerner Apr 13 '20 at 14:13
  • 1
    @northerner: for small `int` values like in your example? That doesn't surprise me. An iterative loop that can branch-mispredict seems unlikely to beat a single hardware `div` instruction on a modern CPU like Broadwell or Skylake, or AMD Zen. The hardware dividers in modern CPUs are slower than everything else, but still like 6 cycle throughput, 26 cycle latency on Skylake for example, for `idiv r32` (which is how `int % int` will compile for non-constant divisor). And when timing each one separately, out-of-order exec of the division can also overlap with the timing overhead. – Peter Cordes Apr 13 '20 at 14:26
  • @PeterCordes that's a great point. Sometimes when I time the default `%` operator it clocks in at 0nanoseconds, followed by 997500nanoseconds (iterated over 1000 times). I guess it's because of what you describe it's possible for something to take 0ns – northerner Apr 13 '20 at 14:47
  • With the same input values repeatedly? That can all CSE down to one. Or if you aren't using the result, then optimize away entirely. Look at the compiler's asm output. Or you said *milli*-seconds? You know that's like 4 million clock cycles on a 4GHz CPU, right? Back to back `div` can run at one per 6 cycles on Skylake, so you should expect ~6000 cycles = 1.5 us (microseconds). See https://agner.org/optimize/ (and https://uops.info/ as a better source than Agner's instruction tables.) – Peter Cordes Apr 13 '20 at 14:53

2 Answers2

3

This is how it could be:

std::chrono::microseconds duration = std::chrono::microseconds::zero();
for (auto &i : inout) {
  auto t1 = std::chrono::high_resolution_clock::now();
  // f1()
  auto t2 = std::chrono::high_resolution_clock::now();
  duration += std::chrono::duration_cast<std::chrono::microseconds>(
      t2 - t1);
}
std::cout << "duration of foo1(): " << duration.count() << std::endl;

duration = std::chrono::microseconds::zero();
for (auto &i : inout) {
  auto t1 = std::chrono::high_resolution_clock::now();
  // f2()
  auto t2 = std::chrono::high_resolution_clock::now();
  duration += std::chrono::duration_cast<std::chrono::microseconds>(
      t2 - t1);
}
std::cout << "duration of foo2(): " << duration.count() << std::endl;
}

Some explanation: If you have an error message like

<source>:23:14: error: no match for 'operator+=' (operand types are 'std::chrono::microseconds' {aka 'std::chrono::duration<long int, std::ratio<1, 1000000> >'} and 'std::chrono::duration<long int, std::ratio<1, 1000000> >::rep' {aka 'long int'})

   23 |     duration += std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();

You can see that there is no operator for microseconds and long int. So you can go to the documentation and see what operators are there for duration and you will find https://en.cppreference.com/w/cpp/chrono/duration/operator_arith3.

You see that you have to pass another duration, so you should not call count() there.

overseas
  • 1,711
  • 1
  • 18
  • 30
0

If I guess you could have used time.h from C library that can help you achieve same thing. Only if you don't want to use chrono class.

Here is what I would suggest.

#include <time.h>
#include <cstring>
#include <iostream>
#include <cstdio>
#include <cstdlib>
#include <unistd.h>
using namespace std;


  void foo1(){

    sleep(5);
  }
  int main(){

    time_t StarTime, EndTime;


    memset( &StarTime, 0x00, sizeof(StarTime));
    memset( &EndTime, 0x00, sizeof(EndTime));

    time(&StarTime);


    foo1();


    time(&EndTime);
    double secs = difftime(EndTime, StarTime);
    double mins =0.0;
    if(secs >= 60){
            mins = (secs/60);
            printf("Total Time:%-10.5f MINS\n",mins);
    }else{
            printf("Total Time:%-10.5f SECS\n",secs);
    }
    return 0;

  }

This is the output observed.

Total Time:5.00000    SECS

As you can see here I got the time taken by foo1() to complete you can do that for foo2() also than you can compare them for benchmarking.

D.Malim
  • 86
  • 6