measuring performance using chrono, order seems to matter although it shouldn't

Question

As an exercise I am trying to measure the efficiency of two algorithms that should do the same task, ordering a stack, using only a stack as a support data structure:

#include <stack>
#include <iostream>
#include <chrono>

std::stack<int> sortStack(std::stack<int>& inS){
  std::stack<int> tmpS;
  int tmpV=0;
  tmpS.push(inS.top());
  inS.pop();

  while(!inS.empty()){
    if(inS.top()>=tmpS.top()){
      tmpS.push(inS.top());
      inS.pop();
    }else{
      tmpV = inS.top();
      inS.pop();
      int count = 0;

      //reverse the stack until we find the item that is smaller
      while(!tmpS.empty()){
        if(tmpS.top()>tmpV){
          inS.push(tmpS.top());
          tmpS.pop();
          count++;
        }else{
          break;
        }
      }
      //tmpS.top is smaller (or =) than tmpV
      tmpS.push(tmpV);

      //and revert the other stack
      for(int i=0; i< count; i++){
        tmpS.push(inS.top());
        inS.pop();
      }
    }
  }

  return tmpS;
}

std::stack<int> sortStackRevisited(std::stack<int>& inS){
  std::stack<int> tmpS;
  int tmpV=0;

  while(!inS.empty()){
    tmpV = inS.top();
    inS.pop();
    //reverse the stack until we find the item that is smaller
      while(!tmpS.empty() && tmpS.top()>tmpV){
          inS.push(tmpS.top());
          tmpS.pop();
      }
      tmpS.push(tmpV);
    }

  return tmpS;
}
int main(){

using namespace std::chrono;
  std::stack<int> s1({1,0,123,3,5,89,23,12,1000});

  std::stack<int> s({1,0,123,3,5,89,23,12,1000});

  auto t1 = high_resolution_clock::now() ;

  std::stack<int> ordered = sortStackRevisited(s);

  auto t2 = high_resolution_clock::now() ;

  std::stack<int> ordered2 = sortStack(s1);

  auto t3 = high_resolution_clock::now() ;

  std::cout<<"\n";
  std::cout<<duration_cast<microseconds>(t2-t1).count()<<" "<<duration_cast<microseconds>(t3-t2).count()<<"\n";
}

running this program I obtain consistently that t2 - t1 is bigger than t3- t2. If I change the order of the function calls, i.e:

  auto t1 = high_resolution_clock::now() ;

  std::stack<int> ordered = sortStack(s);

  auto t2 = high_resolution_clock::now() ;

  std::stack<int> ordered2 = sortStackRevisited(s1);

  auto t3 = high_resolution_clock::now() ;

I still get that t2 - t1 is bigger than t3- t2. What is happening? Am I missing something?

To compile I use g++ --std=c++11 sortStack.cpp

Possible duplicate of [Why is it faster to process a sorted array than an unsorted array?](https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array). Edit : Nevermind, I thought you were working on the same container twice. — François Andrieux, Mar 28 '19 at 15:26
@FrançoisAndrieux the input stack is the same, but two different objects — BiA, Mar 28 '19 at 15:27
The code runs in less than microsecond on my machine so it's quite hard to see differences in the algorithms. — Sami Kuhmonen, Mar 28 '19 at 15:28
First, use a much larger data set so you aggregate out OS stuff from your timings. Secondly. You're measuring different functions. It can be that function 1 is slower than function 2, even though the end result is the same. — NathanOliver, Mar 28 '19 at 15:30
@NathanOliver maybe it was not clear by the question, but the behaviour remain if I change the order of execution of the functions — BiA, Mar 28 '19 at 15:31
@NathanOliver The core of the question is that apparently the result of trying to measure which one is faster depends entirely on which one is called first. — François Andrieux, Mar 28 '19 at 15:32
In general, when timing stuff like that, you want to run each algorithm many times, measure, e.g., the total time it takes to run the algorithm 1000 times, and then compute the average time for one run from that. Especially with such tiny data sets, each run will be so quick that the result of measuring only a single run will basically just be noise… — Michael Kenzel, Mar 28 '19 at 15:32
@MichaelKenzel I understand that, the point is that this is not only noise, since it is consistent. I would expect the two results to be change every run, while the first takes always more time than the second — BiA, Mar 28 '19 at 15:35
Well, you have to keep in mind that you're (most likely) running on a modern operating system with preemptive multitasking that runs on top of a modern superscalar processor. There's all sorts of effects at play here. The first run will warm up the caches. Since your data and code are tiny, this alone may already cause any subsequent run to be significantly faster (that's why one normally discards the results of the first couple runs to allow for a burn-in period). — Michael Kenzel, Mar 28 '19 at 15:49
It may also be that startup of the program happens to consistently take up just enough time to allow your thread to get to the start of the first algorithm before it gets suspended. Since the individual runs take almost no time, that could, e.g., cause the first run to be split over two heartbeats while whatever comes second gets to run in a single one. You're measuring the wall time between when execution reaches certain points in your program here (which includes potential time spent running threads from other applications), not the actual number of cycles the CPU spent just in your process… — Michael Kenzel, Mar 28 '19 at 15:49
@MichaelKenzel thanks, this is going in the right direction. What I find interesting here is that the first function call always takes more time than the second. Running the algorithm several times, as other suggested, doesn’t explain the issue. It just cover it. This doesn’t seem to be noise, since it consistently happens, it doesn’t display any random behavior. It is systematic. It seems like if the first call is in starting something like a cache , that allows the second call to be faster. — BiA, Mar 28 '19 at 19:32

Sami Kuhmonen · Answer 1 · 2019-03-28T15:40:18.373

Running the code just once is not reliable for benchmarking. Your code (after fixing main to return int) runs in under a microsecond on my machine (using -O2 since it should be optimized).

If I change it to run the code 100000 times I get the results that sortStack is faster than sortStackRevisited no matter what order they're being run.

Running sortStack first: 30141 32997

Running sortStackRevisited first: 31244 26642

Even here you can see the variation when running only 100k tests. It's still not very accurate timing.

So I'm assuming you're running unoptimized code and doing only one run which may have all kinds of artifacts causing problems with the results. One of them being you're measuring realtime results, not CPU time, which means anything that the OS throws at the CPU(s) the code is running on will cause it to run slower in realtime even though CPU time would be the same. Or may cause CPU cache to handle things differently, or this or that. There are a lot of possibilities.

Also unoptimized code on my machine runs 20x slower than the optimized which shows there's a lot of things the compiler can do to make things better and affect the end result.

But running both for example a million times will give a relatively good result when comparing them to each other. Both values will vary, but compared to each other they'll stay more the same. For example, on my machine the time divided by another is 1.139-1.142 so it's quite clear that the other method is pretty much 14% slower. If I do just 10 runs the result is that one is 200% slower. With 1000 runs it's 12-28% slower. So more tries brings more accuracy as statistics says.

Hi Sami, Thanks for the answer. The point you are making is exactly what I am asking. Why is it nor reliable? what are the artefacts that are causing the problems? — BiA, Mar 28 '19 at 15:33

measuring performance using chrono, order seems to matter although it shouldn't

1 Answers1