I read an article from Igor's blog. The article said:
... today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!
The article also provides c# code to verify above conclusion:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1 (step = 1)
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2 (step = 16)
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
The two for-loops take about the same time: 80 and 78 ms respectively on Igor's machine, so the cache line machanism is verified.
And then I refer the above idea to implement a c++ version to verify the cache line size as the following:
#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <math.h>
using namespace std::chrono;
const int total_buff_count = 16;
const int buff_size = 32 * 1024 * 1024;
int testCacheHit(int * pBuffer, int size, int step)
{
int result = 0;
for (int i = 0; i < size;) {
result += pBuffer[i];
i += step;
}
return result;
}
int main()
{
int * pBuffer = new int[buff_size*total_buff_count];
for (int i = 0; i < total_buff_count; ++i) {
int step = (int)pow(2, i);
auto start = std::chrono::system_clock::now();
volatile int result = testCacheHit(pBuffer + buff_size*i, buff_size, step);
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end - start;
std::cout << "step: " << step << ", elapsed time: " << elapsed_seconds.count() * 1000 << "ms\n";
}
delete[] pBuffer;
}
But my test result is totally different from the one from Igor's article. If the step is 1 ,then the time cost is about 114ms; If the step is 16, then the time cost is about 78ms. The test application is built with release configuration, there's 32 GB memory on my machine and the CPU is intel Xeon E5 2420 v2 2.2G; the result is the following.
The interesting finding is the time cost decreased significantly when step is 2 and step is 2048. My question is, how to explain the gap when step is 2 and step is 2048 in my test? Why is my result totally different from Igor's result? Thanks.
My own explaination to the first question is, the time cost of the code contains two parts: One is "memory read/write" which contains memory read/write time cost, another is "other cost" which contains for loop and calculation cost. If step is 2, then the "memory read/write" cost almost doesn't change (because of the cache line), but the calculation and loop cost decreased half, so we see an obvious gap. And I guess the cache line on my CPU is 4096 bytes (1024 * 4 bytes) rather than 64 bytes, that's why we got another gap when step is 2048. But it's just my guess. Any help from your guys is appreciated, thanks.