While using memcpy to write to a buffer for multiple times, I can see significant performance differences: writing to a specific address for the first time takes much longer than the second or further times. The observation is 100% reproducible.
I am wondering what would cause such significant performance differences?
See following code example (compilable on Windows):
#include <iostream>
#include <profileapi.h>
LARGE_INTEGER getTimeStamp(void)
{
LARGE_INTEGER t;
QueryPerformanceCounter(&t);
return t;
}
unsigned int getElapsedMicroseconds(LARGE_INTEGER start)
{
LARGE_INTEGER end = getTimeStamp();
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
double t = (double)(end.QuadPart - start.QuadPart) * 1000000.0 / (double)freq.QuadPart;
return (unsigned int)(t + 0.5);
}
int main(int argc, char** argv)
{
static const size_t singleBuffSize = 36 * 1024 * 1024;
static const size_t nrOfBuffers = 6;
unsigned char* srcBuff = new unsigned char[singleBuffSize];
unsigned char* dstBuff = new unsigned char[singleBuffSize * nrOfBuffers];
for (int i = 0; i < (nrOfBuffers*3); i++)
{
size_t buffIdx = (i % nrOfBuffers) * singleBuffSize;
LARGE_INTEGER start = getTimeStamp();
memcpy(&dstBuff[buffIdx], srcBuff, singleBuffSize);
unsigned int elapsedMicroseconds = getElapsedMicroseconds(start);
printf("Loop %2d: buffer nr %2lu, elapsed time = %6u microseconds\n", i+1, ((i % nrOfBuffers) + 1), elapsedMicroseconds);
}
delete[] srcBuff;
delete[] dstBuff;
return 0;
}
Example result:
Loop 1: buffer nr 1, elapsed time = 76207 microseconds
Loop 2: buffer nr 2, elapsed time = 25552 microseconds
Loop 3: buffer nr 3, elapsed time = 24200 microseconds
Loop 4: buffer nr 4, elapsed time = 24036 microseconds
Loop 5: buffer nr 5, elapsed time = 28470 microseconds
Loop 6: buffer nr 6, elapsed time = 58528 microseconds
Loop 7: buffer nr 1, elapsed time = 6428 microseconds
Loop 8: buffer nr 2, elapsed time = 9324 microseconds
Loop 9: buffer nr 3, elapsed time = 9389 microseconds
Loop 10: buffer nr 4, elapsed time = 9434 microseconds
Loop 11: buffer nr 5, elapsed time = 9641 microseconds
Loop 12: buffer nr 6, elapsed time = 9953 microseconds
Loop 13: buffer nr 1, elapsed time = 9488 microseconds
Loop 14: buffer nr 2, elapsed time = 9834 microseconds
Loop 15: buffer nr 3, elapsed time = 6211 microseconds
Loop 16: buffer nr 4, elapsed time = 6282 microseconds
Loop 17: buffer nr 5, elapsed time = 5950 microseconds
Loop 18: buffer nr 6, elapsed time = 9570 microseconds
E.g. for buffer nr. 1, the first memcpy call takes much longer than subsequent calls