First method call takes 10 times longer than consecutive calls with the same data

Question

I am performing some execution time benchmarks for my implementation of quicksort. Out of 100 successive measurements on exactly the same input data it seems like the first call to quicksort takes roughly 10 times longer than all consecutive calls. Is this a consequence of the operating system getting ready to execute the program, or is there some other explanation? Moreover, is it reasonable to discard the first measurement when computing an average runtime?

The below bar chart illustrates execution time (miliseconds) versus method call number. Each time the method is called it processes the exact same data.

To produce this particular graph the main method makes a call to quicksort_timer::time_fpi_quicksort(5, 100) whose implementation can be seen below.

static void time_fpi_quicksort(int size, int runs)
{
    std::vector<int> vector(size);
    for (int i = 0; i < runs; i++)
    {
        vector = utilities::getRandomIntVectorWithConstantSeed(size);
        Timer timer;
        quicksort(vector, ver::FixedPivotInsertion);
    }
}

The getRandomIntVectorWithConstantSeed is implemented as follows

   std::vector<int> getRandomIntVectorWithConstantSeed(int size)
   {
      std::vector<int> vector(size);
      srand(6475307);
      for (int i = 0; i < size; i++)
         vector[i] = rand();
      return vector;
   }

CPU and Compilation

CPU: Broadwell 2.7 GHz Intel Core i5 (5257U)

Compiler Version: Apple LLVM version 10.0.0 (clang-1000.11.45.5)

Compiler Options: -std=c++17 -O2 -march=native

First, Did you compile with optimizations turned on? Second, I've not seen a 10x difference but normally the first run is slower since the code is "cold". — NathanOliver, Mar 13 '19 at 20:25
Third, a vector of size 5 really doesn't provide a meaningful benchmark. Use at least a few thousand items. Fourth, how is `getRandomIntVectorWithConstantSeed` implemented? — NathanOliver, Mar 13 '19 at 20:28
@NathanOliver I have edited my post to include the implementation of ``getRandomIntVectorWithConstantSeed``. I need to perform benchmakrs for small lists because I want to see at which point my quicksort implementation starts outperforming insertion sort. — K. Claesson, Mar 13 '19 at 20:43
@NathanOliver No, I did not compile with optimizations turned on. Nevertheless, compiling with optimizations on still results in the first method call taking roughly 10 times longer than the others. — K. Claesson, Mar 13 '19 at 21:00
Benchmarking with optimization disabled is generally useless. That said, there obviously is an effect here, but there are many startup effects in microbenchmarking. — Peter Cordes, Mar 13 '19 at 23:01

score 3 · Accepted Answer · answered Mar 13 '19 at 22:59

3

Yes, it could be a page fault on the page holding the code for the sort function (and the timing code itself). The 10x could also include ramp-up to max turbo clock speed.

Caching is not plausible, though: you're writing the (tiny) array outside the timed region, unless the compiler somehow reordered the init with the constructor of your Timer. Memory allocation being much slower the first time would easily explain it, maybe having to make a system call to get a new page the first time, but later calls to new (to construct std::vector) just grabbing already-hot-in-cache memory from the free list.

Training the branch predictors could also be a big factor, but you'd expect it to take more than 1 run before the TAGE branch predictors in a modern Intel CPU, or the perceptron predictors in a modern AMD, "learned" the full pattern of all the branching. But maybe they get close after the first run.

Note that you produce the same random array every time, by using srand() on every call. To test if branch prediction is the explanation, remove the srand so you get different arrays every time, and see if the time stays much higher.

What CPU, compiler version / options, etc. are you using?

answered Mar 13 '19 at 22:59

Peter Cordes

328,167
45
605
847

1

I tried using different random arrays and the average time stays roughly the same (neither obviously higher nor lower). – K. Claesson Mar 14 '19 at 09:39
1

@K.Claesson: ok, then it is branch prediction. Apparently even 1 iteration is enough to make a huge difference. You can see the 2nd iteration in your plot is still slightly slower than the rest, so it's still learning the pattern. And compiling with optimization disabled is enough of a bottleneck to hide a few mispredicts. (BTW, "core i5" tells us that it's an Intel mainstream CPU of Nehalem or newer, so at least 2008 or so. Sandybridge changed a lot of internals, so just saying "i5" doesn't narrow it down much. Not that it matters in this case, TAGE since SnB with various improvements.) – Peter Cordes Mar 14 '19 at 09:43
Given that it is branch prediction that causes the significant performance boost after the first run (or first couple runs), would you discard the first measurement when calculating the average execution time of the algorithm and present the execution time of the first iteration separetely, or would you include the first measurement in the average? Which would be the best way to quantify the "real-world" performance of the algorithm? – K. Claesson Mar 14 '19 at 10:05
1

@K.Claesson: Is the real-world use-case sorting the same input repeatedly? Or is it sorting different inputs? For sorting, it's almost always the variable-input case that's interesting, so you should remove the `srand` from your input generator and microbenchmark with varying data. It's also interesting to measure the already-sorted case, and maybe the warmed-up case. (By contrast, binary searching can be different. Sometimes repeated queries for the same key are common, so warming up the branch predictors for a fixed data-set and key is a "fair" microbenchmark). – Peter Cordes Mar 14 '19 at 10:12
The real-world use case is sorting different inputs. I have modified my tests so that I sort a set of different vectors (to minimize the "bias" of the vector) and sort each vector multiple times (to minimize "bias" from the particular run). – K. Claesson Mar 14 '19 at 10:18
1

@K.Claesson: Modern Intel CPUs can "learn" surprisingly long branch-prediction patterns. e.g. I was playing around with a bubblesort (optimized for tiny code size), and tried benchmarking it by copying in the same data repeatedly. I got like 0.1% branch mispredicts IIRC, even with a 16-element array or so on Skylake! So be careful just alternating a couple different vectors. But if you rotate through copying enough different input vectors (or better, copy slices of a big array into the same `std::vector`, to avoid any alloc/dealloc), you should get "realistic" times. – Peter Cordes Mar 14 '19 at 10:27

score 0 · Answer 2 · edited Mar 13 '19 at 22:50

0

Probably is because of caching, as the memory needs to be fetched from DRAM and allocated in CPU's data cache the first time. That takes (much) more latency more than loads that hit in the CPU's cache.

Then as your instructions are in the pipeline they follow the same branch as it is the instructions from the same memory source as it doesn't need to be invalidated because is the same pointer.

Would be interesting if you implement 4 methods with more or less the same functionality and then swap between them to see what happen.

edited Mar 13 '19 at 22:50

Peter Cordes

328,167
45
605
847

answered Mar 13 '19 at 22:28

Carlos Bernal

11
1

The OP is writing the array right before sorting; it will already be hot in L1d cache. (And registers like you originally said isn't plausible; they aren't a cache unless the compiler uses them that way, and that's not plausible unless the benchmark mostly optimized away.) – Peter Cordes Mar 13 '19 at 23:00

First method call takes 10 times longer than consecutive calls with the same data

2 Answers2