For context this question is related to the blog post on Cache Processor Effects, specifically Example 1-2.
In the code snippet below, I'm increasing the step size by 2 each time i.e. the number of operations I'm performing is decreasing by a factor of 2 each time. From the blog post, I am expecting to see for step size 1-16, the average time to complete the loop remains roughly the same. The main intuitions discussed by the author were 1) Majority of the time is contributed by memory access (i.e. we fetch then multiply) rather than arithmetic operations, 2) Each time the cpu fetches a cacheline (i.e. 64 bytes or 16 int).
I've tried to replicate the experiment on my local machine with the following code. Note that I've allocated a new int array for every step size so that they do not take advantage of previous cached data. For a similar reason, I've also only "repeated" the inner for loop for each step size only once (instead of repeating the experiment multiple times).
constexpr long long size = 64 * 1024 * 1024; // 64 MB
for (int step = 1; step <= 1<<15 ; step <<= 1) {
auto* arr = new int[size];
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < size; i += step) {
arr[i] *= 3;
}
auto finish = std::chrono::high_resolution_clock::now();
auto microseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << step << " : " << microseconds.count() << "ms\n";
// delete[] arr; (updated - see Paul's comment)
}
Result, however, was very different from what was described in the blog post.
Without optimization: clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp
1 : 222ms
2 : 176ms
4 : 152ms
8 : 140ms
16 : 135ms
32 : 128ms
64 : 130ms
128 : 125ms
256 : 123ms
512 : 118ms
1024 : 121ms
2048 : 62ms
4096 : 32ms
8192 : 16ms
16384 : 8ms
32768 : 4ms
With -O3 optimization clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp -O3
1 : 155ms
2 : 145ms
4 : 134ms
8 : 131ms
16 : 125ms
32 : 130ms
64 : 130ms
128 : 121ms
256 : 123ms
512 : 127ms
1024 : 123ms
2048 : 62ms
4096 : 31ms
8192 : 15ms
16384 : 8ms
32768 : 4ms
Note that I'm running with Macbook Pro 2019 and my pagesize is 4096. From the observation above, it seems like until 1024 step size the time taken roughly remain linear. Since each int is 4 bytes, this seem related to the size of a page (i.e. 1024*4 = 4096) which makes me think this might be some kind of prefetching/page related optimization even when there is no optimization specified?
Does someone have any ideas or explanation on why these numbers are occurring?