2

My goal is to measure the effect of (different) cache(s) using a simple code. I'm following this article, specifically page 20 and 21: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

I'm working on a 64-bit linux. L1d cache is 32K, L2 is 256K, and L3 is 25M.

This is my code (I compile this code with g++ with no flags):

#include <iostream>

// ***********************************
// This is for measuring CPU clocks
#if defined(__i386__)
static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}
#elif defined(__x86_64__)
static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}
#endif
// ***********************************


static const int ARRAY_SIZE = 100;

struct MyStruct {
    struct MyStruct *n;
};

int main() {
    MyStruct myS[ARRAY_SIZE];
    unsigned long long cpu_checkpoint_start, cpu_checkpoint_finish;

    //  Initializing the array of structs, each element pointing to the next 
    for (int i=0; i < ARRAY_SIZE - 1; i++){
        myS[i].n = &myS[i + 1];
        for (int j = 0; j < NPAD; j++)
            myS[i].pad[j] = (long int) i;
    }
    myS[ARRAY_SIZE - 1].n = NULL;   // the last one
    for (int j = 0; j < NPAD; j++)
        myS[ARRAY_SIZE - 1].pad[j] = (long int) (ARRAY_SIZE - 1);

    // Filling the cache
    MyStruct *current = &myS[0];
    while ((current = current->n) != NULL)
        ;

    // Sequential access
    current = &myS[0];

    // For CPU usage in terms of clocks (ticks)
    cpu_start = rdtsc();

    while ((current = current->n) != NULL)
        ;

    cpu_finish = rdtsc();

    unsigned long long avg_cpu_clocks = (cpu_finish - cpu_start) / ARRAY_SIZE;

    std::cout << "Avg CPU Clocks:   " << avg_cpu_clocks << std::endl;
    return 0;
}

I have two problems:

1- I varied ARRAY_SIZE from 1 to 1,000,000 (so the size of my array ranges between 2B to 2MB), but the average CPU clock is always 10.

According to that PDF (figure 3-10 on page 21), I would have expected to get 3-5 clocks when the array can fit entirely into L1, and get higher numbers (9 cycles) when it exceeds L1's size.

2- If I increase ARRAY_SIZE beyond 1,000,000, I'll get segmentation fault (core dumped), which is due to stack overflow. My question is whether using dynamic allocation (MyStruct *myS = new MyStruct[ARRAY_SIZE]) does not incur any performance penalty.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
narengi
  • 1,345
  • 3
  • 17
  • 38
  • 3
    You need to ask your compiler to optimize for benchmarking purposes. So compile with `g++ -O2 -Wall -mtune=native`; don't use `rdtsc` but read [time(7)](http://man7.org/linux/man-pages/man7/time.7.html) – Basile Starynkevitch May 12 '15 at 16:39
  • @BasileStarynkevitch I used these flags, and now the average cpu clocks is 4, still regardless of the length of the array. This is suggesting that either everything could fit in L1d (which is not the case) or prefetching does an excellent job (which I doubt). – narengi May 12 '15 at 17:27
  • @BasileStarynkevitch Also, could you please tell me why rdtsc is not suitable for this purpose? – narengi May 12 '15 at 17:27
  • 2
    @narengi: rdtsc isn't reliable, because it's different for each core. If it checks one core, does math, then checks the other core, it could report that it completed in -100 ticks. – Mooing Duck May 12 '15 at 17:42
  • 2
    As for 4 clocks, your while loop does literally nothing at all, so the compiler removed it for you. Benchmarking is hard. – Mooing Duck May 12 '15 at 17:44
  • @MooingDuck I edited the code to avoid having a dummy while loop. Regarding the `rdtsc`, i used `clock()` instead of it, but got very low numbers (0.002 per each iteration of the loop). I tried to pin the current process to a particular core using `sched_setaffinity()`, but `rdtsc` numbers didn't change. – narengi May 12 '15 at 18:24
  • Most processors (all new ones since about 2010) have "Invariant TSC" anyway, so you can forget about the affinity business these days. You should however serialize it with respect to the code you're measuring (you can use `lfence`, even though that makes no sense), or use `rdtscp` (which does that automatically). – harold May 12 '15 at 19:05

1 Answers1

3

This is my code (I compile this code with g++ with no flags)

If you don't pass -O3, then while ((current = current->n) != NULL) will be compiled in to multiple memory accesses, not a single load instruction. By passing -O3, the loop will be compiled into:

.L3:
mov     rax, QWORD PTR [rax]
test    rax, rax
jne     .L3

This will run at 4 cycles per iteration as you are expecting.

Note that you can use the __rdtsc compiler intrinsic instead of inline assembly. See: Get CPU cycle count?.

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • Thank you for your response. You mentioned that my method of measurement is not reliable. I changed rdtsc to rdtscp in my code, and still get the same results. What is the proper method for benchmarking? I just came across intel's PCM tool, but had some problems getting it to work. I would most certainly be interested in a tool which gives me cache hit/miss cnts, clock per instruction, ... . Do you of any tool/library? – narengi May 12 '15 at 23:34
  • @narengi I've updated the answer in case you are still interested. – Hadi Brais Feb 27 '19 at 00:21