3

I read an article from Igor's blog. The article said:

... today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!

The article also provides c# code to verify above conclusion:

int[] arr = new int[64 * 1024 * 1024];

// Loop 1 (step = 1)
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2 (step = 16)
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

The two for-loops take about the same time: 80 and 78 ms respectively on Igor's machine, so the cache line machanism is verified.

And then I refer the above idea to implement a c++ version to verify the cache line size as the following:

#include "stdafx.h"
#include <iostream>
#include <chrono>
#include <math.h>
using namespace std::chrono;

const int total_buff_count = 16;
const int buff_size = 32 * 1024 * 1024;

int testCacheHit(int * pBuffer, int size, int step)
{
    int result = 0;
    for (int i = 0; i < size;) {
        result += pBuffer[i];
        i += step;
    }

    return result;
}

int main()
{
    int * pBuffer = new int[buff_size*total_buff_count];

    for (int i = 0; i < total_buff_count; ++i) {
        int step = (int)pow(2, i);

        auto start = std::chrono::system_clock::now();
        volatile int result = testCacheHit(pBuffer + buff_size*i, buff_size, step);
        auto end = std::chrono::system_clock::now();

        std::chrono::duration<double> elapsed_seconds = end - start;
        std::cout << "step: " << step << ", elapsed time: " << elapsed_seconds.count() * 1000 << "ms\n";
    }

    delete[] pBuffer;
}

But my test result is totally different from the one from Igor's article. If the step is 1 ,then the time cost is about 114ms; If the step is 16, then the time cost is about 78ms. The test application is built with release configuration, there's 32 GB memory on my machine and the CPU is intel Xeon E5 2420 v2 2.2G; the result is the following. C++ test result

The interesting finding is the time cost decreased significantly when step is 2 and step is 2048. My question is, how to explain the gap when step is 2 and step is 2048 in my test? Why is my result totally different from Igor's result? Thanks.

My own explaination to the first question is, the time cost of the code contains two parts: One is "memory read/write" which contains memory read/write time cost, another is "other cost" which contains for loop and calculation cost. If step is 2, then the "memory read/write" cost almost doesn't change (because of the cache line), but the calculation and loop cost decreased half, so we see an obvious gap. And I guess the cache line on my CPU is 4096 bytes (1024 * 4 bytes) rather than 64 bytes, that's why we got another gap when step is 2048. But it's just my guess. Any help from your guys is appreciated, thanks.

Jun Ge
  • 408
  • 4
  • 13

3 Answers3

5

Drop between 1024 and 2048

Note that you are using an uninitialized array. This basically means that

int * pBuffer = new int[buff_size*total_buff_count];

does not cause your program to actually ask for any physical memory. Instead, just some virtual address space is reserved.

Then, whey you first touch some array element, a page fault is triggered and the OS maps the page to physical memory. This is a relatively slow operation which may significantly influece your experiment. Since a page size on your system is likely 4 kB, it can hold 1024 4-byte integers. When you go for 2048 step, then only every second page is actually accessed and the runtime drops proportionally.

You can avoid the negative effect of this mechanism by "touching" the memory in advance:

int * pBuffer = new int[buff_size*total_buff_count]{};

When I tried that, I got almost linear decrease of time between 64 and 8192 step sizes.

Drop between 1 and 2

A cache line size on your system is definitely not 2048 bytes, it's very likely 64 bytes (generally, it may have different values and even different values for different cache levels).

As for the first part, for step being 1, there are simply much more arithmetic operations involved (addition of array elements and increments of i).

Difference from Igor's experiment

We can only speculate about why Igor's experiment gave practically the same times in both cases. I would guess that the runtime of arithmetics is negligible there, since there is only a single loop counter increment involved and he writes into the array, which requires an additional transfer of cached lines back to memory. (We can say that the byte/op ratio is much higher than in your experiment.)

Community
  • 1
  • 1
Daniel Langr
  • 22,196
  • 3
  • 50
  • 93
1

How to verify CPU cache line size with c++ code?

There is std::hardware_destructive_interference_size in C++17, which should provide the smallest cache line size. Note that it is a compile time value and the compiler relies on your input on what machine is targeted. When targeting entire architecture, the number may be inaccurate.

eerorika
  • 232,697
  • 12
  • 197
  • 326
0

How to verify CPU cache line size with c++ code?

You reliably cannot.

And you should write portable C++ code. Read n3337.

Imagine that you did not enable compiler optimizations in your C++ compiler. And imagine that you run your C++ compiler in some emulator (like these).

On Linux specifically, you could parse the /proc/cpuinfo pseudo file and get the CPU cache line size from it.

For example:

% head -20 /proc/cpuinfo
processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 23
model       : 8
model name  : AMD Ryzen Threadripper 2970WX 24-Core Processor
stepping    : 2
microcode   : 0x800820b
cpu MHz     : 1776.031
cache size  : 512 KB
physical id : 0
siblings    : 48
core id     : 0
cpu cores   : 24
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca

BTW, there are many different organizations and levels of caches.

You could imagine a C++ application on Linux parsing the output of /proc/cpuinfo then making HTTP requests (using libcurl) to the Web to get more from it.

See also this answer.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 1
    @JunGe Also note that cache line sizes might be generally different for different cache levels. – Daniel Langr Oct 05 '19 at 11:03
  • 1
    @DanielLangr there's a funny thing about that, AMD defined CPUID leaf 0x80000006 such that it only has one field for the cache line size.. the non-AMD cache information takes the level of cache as a parameter so it can give different line sizes for different levels – harold Oct 05 '19 at 11:27