1

For context this question is related to the blog post on Cache Processor Effects, specifically Example 1-2.

In the code snippet below, I'm increasing the step size by 2 each time i.e. the number of operations I'm performing is decreasing by a factor of 2 each time. From the blog post, I am expecting to see for step size 1-16, the average time to complete the loop remains roughly the same. The main intuitions discussed by the author were 1) Majority of the time is contributed by memory access (i.e. we fetch then multiply) rather than arithmetic operations, 2) Each time the cpu fetches a cacheline (i.e. 64 bytes or 16 int).

I've tried to replicate the experiment on my local machine with the following code. Note that I've allocated a new int array for every step size so that they do not take advantage of previous cached data. For a similar reason, I've also only "repeated" the inner for loop for each step size only once (instead of repeating the experiment multiple times).

constexpr long long size = 64 * 1024 * 1024; // 64 MB
for (int step = 1; step <= 1<<15 ; step <<= 1) {
        auto* arr = new int[size]; 
        auto start = std::chrono::high_resolution_clock::now();
        for (size_t i = 0; i < size; i += step) {
            arr[i] *= 3;
        }
        auto finish = std::chrono::high_resolution_clock::now();
        auto microseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
        std::cout << step << " : " << microseconds.count() << "ms\n";
        // delete[] arr; (updated - see Paul's comment)
    }

Result, however, was very different from what was described in the blog post.

Without optimization: clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp

1 : 222ms
2 : 176ms
4 : 152ms
8 : 140ms
16 : 135ms
32 : 128ms
64 : 130ms
128 : 125ms
256 : 123ms
512 : 118ms
1024 : 121ms
2048 : 62ms
4096 : 32ms
8192 : 16ms
16384 : 8ms
32768 : 4ms

With -O3 optimization clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp -O3

1 : 155ms
2 : 145ms
4 : 134ms
8 : 131ms
16 : 125ms
32 : 130ms
64 : 130ms
128 : 121ms
256 : 123ms
512 : 127ms
1024 : 123ms
2048 : 62ms
4096 : 31ms
8192 : 15ms
16384 : 8ms
32768 : 4ms

Note that I'm running with Macbook Pro 2019 and my pagesize is 4096. From the observation above, it seems like until 1024 step size the time taken roughly remain linear. Since each int is 4 bytes, this seem related to the size of a page (i.e. 1024*4 = 4096) which makes me think this might be some kind of prefetching/page related optimization even when there is no optimization specified?

Does someone have any ideas or explanation on why these numbers are occurring?

Happytreat
  • 139
  • 7
  • 5
    _Note that I've allocated a new int array for every step size so that they do not take advantage of previous cached data._ But are you certain, though, that `arr` isn't being allocated in the same place every time round the loop? Try not deleting it (just let it leak). – Paul Sanders Jan 26 '23 at 07:43
  • 1
    Hey, good point. I've updated to remove `delete[] arr`, however, the numbers from the experiment look pretty much the same (same trend) as the one above. – Happytreat Jan 26 '23 at 07:53
  • Strange blog ... I would just dump it. As array arr is uninitialised *= 3 to its elements is undefined behavior ... unsure what to expect as all bets are off? OTOH if to make it new int[size]{} then modern compiler may figure that multiplying 0 with 3 is pointless. – Öö Tiib Jan 26 '23 at 08:32
  • @ÖöTiib The blog uses C#, not C++. – molbdnilo Jan 26 '23 at 08:35
  • @molbdnilo good catch ... I would still dump using C# code in C++ as the C++ compilers optimise really aggressively also using UB as excuse. – Öö Tiib Jan 26 '23 at 08:40
  • 1
    Relevant qustion: https://stackoverflow.com/q/58247635/580083 – Daniel Langr Jan 26 '23 at 08:47
  • @Happytreat I tried with random numbers to distort the predictability of the code but alas.. same story! I think one needs to use a profiler to track the memory reads/writes... and cache miss/hits etc. – Andreas Hadjigeorgiou Jan 26 '23 at 10:08
  • Re-profiling, there is [this](https://www.amd.com/en/developer/uprof.html). I've not used it myself but maybe it can throw some light on what you are observing. – Paul Sanders Jan 26 '23 at 12:25

1 Answers1

0

In your code, you called new int[size] which is essentially a wrapper around malloc. The kernel does not immediately allocate a physical page/memory to it due to Linux's optimistic memory allocation strategy (see man malloc).

What happens when you call arr[i] *= 3 is that a page fault will occur if the page is not in the Translation Lookup Buffer (TLB). Kernel will check that your requested virtual page is valid, but an associated physical page is not allocated yet. The kernel will assign your requested virtual page with a physical page.

For step = 1024, you are requesting every page associated to arr. For step = 2048, you are requesting every other page associated to arr.

This act of assigning a physical page is your bottleneck (based on your data it takes ~120ms to assign 64mb of pages one by one). When you increase step from 1024 to 2048, now the kernel doesn't need to allocate physical pages for every virtual page associated to arr hence the runtime halves.

As linked by @Daniel Langr, you'll need to "touch" each element of arr or zero-initialize the arr with new int[size]{}. What this does is to force the kernel to assign physical pages to each virtual page of arr.

weineng
  • 146
  • 6
  • 2
    You're conflating TLB miss with page fault. The kernel is "lazy" about wiring the page tables to point to it either a shared physical page of zeros or (on a write) a newly allocated page. So you get a page fault which is handled by software (the kernel), not just a TLB miss; the TLB miss is what triggers the hardware page walk. But yes, correct overall point. (Except you need to write 1 element per page; just reading would still leave them CoW mapped to a shared page of zeros.) – Peter Cordes Jan 26 '23 at 18:35
  • @PeterCordes oops you are right about the TLB miss. However, just reading 1 element per page without any writing should also trigger the same behavior since the physical page itself wasn't allocated yet. Or am I missing something? – weineng Jan 26 '23 at 18:41
  • 1
    Linux at least handles read page faults on untouched anonymous pages (like this, or like the BSS) by wiring the page-table entry with a read-only entry pointing to a system-wide-shared physical page of all zeros. (4k or 2M). The first write will page-fault again, actually allocating a freshly zeroed page to back the virtual page. (This is what I described as copy-on-write mapping to a shared page of zeros.) [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/a/57130924) shows how you can get TLB misses but L1d or L3 hits. – Peter Cordes Jan 27 '23 at 01:43