1

We are profiling a complex C++ program that performs multiple iterations of an algorithm. Timing is critical and we want to minimise the execution time of each iteration. The algorithm is such that the execution time should be very similar for each iteration, but we find that the execution time decreases with successive iterations. We suspect that the cache is responsible for this but can't fully explain what we see based on our understanding of caches.

We are running the code on an Intel Xeon processor with Centos 7.6, compiled by g++ 7.3.1.

We managed to demonstrate the behaviour using the simple program shown below:

#include <vector>
#include <fstream>
#include <array>
#include <chrono>
#include <iostream>

int main()
{
    std::chrono::high_resolution_clock::time_point t1;
    std::chrono::high_resolution_clock::time_point t2;

    const unsigned NUM_BUFFERS = 200;
    const unsigned BUFFER_SIZE_BYTES = 1024 * 1024;
    const unsigned NUM_TRIALS = 50;

    std::vector<uint8_t*> buffers;

    for (int buff=0; buff<NUM_BUFFERS; ++buff)
        buffers.push_back(new uint8_t[BUFFER_SIZE_BYTES]);

    std::vector<double> tAll;   // Records execution time for each buffer write
    tAll.resize(NUM_TRIALS*NUM_BUFFERS);

    unsigned indt = 0;

    // For each trial
    for ( unsigned indTrial=0; indTrial<NUM_TRIALS; indTrial++ )
    {
        // For all buffers
        for ( unsigned indBuffer=0; indBuffer<NUM_BUFFERS; indBuffer++ )
        {
            t1 = std::chrono::high_resolution_clock::now();

            // Increment contents of entire buffer
            uint8_t* p_buff = buffers.at(indBuffer);
            for ( unsigned ind=0; ind<BUFFER_SIZE_BYTES; ind++ )
            {
                p_buff[ind]++;
            }
            t2 = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> duration = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
            tAll.at(indt++) = duration.count();
        }
    }

    // Write execution times to a file
    std::ofstream fp;
    fp.open("TEST_ARTEMIS.TXT");
    double max=0;
    for ( unsigned ind=0; ind<tAll.size(); ind++ )
    {
        fp << tAll[ind] << std::endl;
    }
}

This program increments every byte of a series of 200 off 1MB buffers.The process is repeated 50 times.

The time for each complete write of each buffer is written to a file. If we plot those times, and zoom in to the first 250 buffer writes, we see:

enter image description here

The first buffer write takes ~10ms, the next few take ~3ms, the next 200 take ~2.5ms, and the time then drops to 2ms.

We don't think this behaviour can be explained just by simple cache behaviour as the L2/L3 caches are not large enough to contain all the buffers, so cache writes should be happening throughout the experiment. It's as though the memory gets 'warmed up' and gets faster with time.

Can anyone suggest an explanation for what we are seeing please?

DavidA
  • 2,053
  • 6
  • 30
  • 54
  • 1
    A complete *guess*: If your access pattern is predictable, it might be the CPUs prefetcher getting wise to what you are doing. – Jesper Juhl Feb 04 '20 at 16:41
  • 1
    Per [this question](https://stackoverflow.com/questions/53517653/in-which-condition-dcu-prefetcher-start-prefetching) about the prefetcher, there is some confidence value, so this kind of warming-up characteristic is plausible. The fine details would depend on the exact architecture. – Useless Feb 04 '20 at 16:47
  • 1
    It might also be because the CPU frequency increasing to handle the demand. – 1201ProgramAlarm Feb 04 '20 at 16:47
  • The branch predictor should be warmed up way earlier in the process, and I can't think of anything else with a similar pattern. And yes, you need to disable frequency scaling entirely if you want your timings to mean something. – Useless Feb 04 '20 at 16:48
  • Thanks very much for your comments. – DavidA Feb 04 '20 at 16:52
  • On Windows and with VS2015 I see a really predictable behavior. – TonySalimi Feb 04 '20 at 17:01
  • I've experimented with this a bit on Windows with VS2019 and get the same effect. The size of the buffer does not matter, and it is always when it starts the second iteration of the outer loop when you get that last speed drop. The size of the drop is larger for me, though, where the drop in the displayed results changes from 0.0003 to 0.00009 (9e-5). – 1201ProgramAlarm Feb 04 '20 at 17:04
  • If I add a `memset` to zero out the newly allocated memory, that slow initial loop goes away - they're all the same. – 1201ProgramAlarm Feb 04 '20 at 17:19
  • Regarding prefetcher / branch predictor: In this example code the processing in the inner loop is very simple. But in our real program the code in the (equivalent of the) inner loop is much larger, involving multiple stages of different processing. Can the prefetcher or branch predictor still be an explanation in this case? My understanding is that they only work over small sections of code. – DavidA Feb 04 '20 at 21:09

0 Answers0