We are profiling a complex C++ program that performs multiple iterations of an algorithm. Timing is critical and we want to minimise the execution time of each iteration. The algorithm is such that the execution time should be very similar for each iteration, but we find that the execution time decreases with successive iterations. We suspect that the cache is responsible for this but can't fully explain what we see based on our understanding of caches.
We are running the code on an Intel Xeon processor with Centos 7.6, compiled by g++ 7.3.1.
We managed to demonstrate the behaviour using the simple program shown below:
#include <vector>
#include <fstream>
#include <array>
#include <chrono>
#include <iostream>
int main()
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
const unsigned NUM_BUFFERS = 200;
const unsigned BUFFER_SIZE_BYTES = 1024 * 1024;
const unsigned NUM_TRIALS = 50;
std::vector<uint8_t*> buffers;
for (int buff=0; buff<NUM_BUFFERS; ++buff)
buffers.push_back(new uint8_t[BUFFER_SIZE_BYTES]);
std::vector<double> tAll; // Records execution time for each buffer write
tAll.resize(NUM_TRIALS*NUM_BUFFERS);
unsigned indt = 0;
// For each trial
for ( unsigned indTrial=0; indTrial<NUM_TRIALS; indTrial++ )
{
// For all buffers
for ( unsigned indBuffer=0; indBuffer<NUM_BUFFERS; indBuffer++ )
{
t1 = std::chrono::high_resolution_clock::now();
// Increment contents of entire buffer
uint8_t* p_buff = buffers.at(indBuffer);
for ( unsigned ind=0; ind<BUFFER_SIZE_BYTES; ind++ )
{
p_buff[ind]++;
}
t2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);
tAll.at(indt++) = duration.count();
}
}
// Write execution times to a file
std::ofstream fp;
fp.open("TEST_ARTEMIS.TXT");
double max=0;
for ( unsigned ind=0; ind<tAll.size(); ind++ )
{
fp << tAll[ind] << std::endl;
}
}
This program increments every byte of a series of 200 off 1MB buffers.The process is repeated 50 times.
The time for each complete write of each buffer is written to a file. If we plot those times, and zoom in to the first 250 buffer writes, we see:
The first buffer write takes ~10ms, the next few take ~3ms, the next 200 take ~2.5ms, and the time then drops to 2ms.
We don't think this behaviour can be explained just by simple cache behaviour as the L2/L3 caches are not large enough to contain all the buffers, so cache writes should be happening throughout the experiment. It's as though the memory gets 'warmed up' and gets faster with time.
Can anyone suggest an explanation for what we are seeing please?