Why CLFLUSH affects nothing?

Question

The following code reads an array in that way that it loads one element per cache line, supposing that it is 64 bytes, then makes use of clflush for each line and reads the array once again. That said, timings of the second reading are shorter. I wonder why. It seems that clflush does not invalidate the cache lines.

Btw, Do the cache lines consist of exactly 64 bytes per line? I have this question since I have tried to change the step from 16 ints to 32 and even 64, but the second reading has been still faster.

#include <time.h>
#include <cstdlib>
#include <cstdio>

#define ARR1_LEN 16384

#define PRINT_DUR {\
    printf("%ld - %ld = %ld\n%.20Lf\n", t2, t1, t2-t1, ((long double)(t2 - t1))/CLOCKS_PER_SEC);\
}

#define CLEAR_CACHE {\
    asm("movq %1, %%rcx; movq %0, %%rax; label_%=: clflush (%%rax); addq $64, %%rax; loop label_%= ;"::"r"(arr1), "i"((ARR1_LEN>>5) -1):"rcx", "rax");\
}


int main() {
    int *arr1_ = (int*)malloc(sizeof(int) * ARR1_LEN + 64);
    int temp;
    if (!arr1_) {
        fprintf(stderr, "Memory allocation error\n");
        return 0;
    }
    int *arr1 = (int*)((((size_t)arr1_)+63)&0xffffffffffffffc0);

    clock_t t1, t2;
    t1 = clock();

    for (int i = 0; i < (ARR1_LEN>>4); i++) {
        temp = arr1[i<<4];
    }
    t2 = clock();

//  __builtin___clear_cache(arr1, arr1 + ARR1_LEN -1); // It compiles into nothing at all
    CLEAR_CACHE

    PRINT_DUR

    t1 = clock();
    for (int i = 0; i < (ARR1_LEN>>4); i++) {
        temp = arr1[(i<<4) + 32];
    }
    t2 = clock();

    PRINT_DUR

    free(arr1_);
    return 0;
}

BTW, `__builtin___clear_cache` is misnamed. The actual meaning is "sync I-cache", making it safe to execute data (you just stored) as code. x86 has coherent I-caches, so the only effect is to tell the optimizer the store-data is used, not dead. [How to get c code to execute hex machine code?](https://stackoverflow.com/a/55893781) links an example where it's necessary. It's irrelevant to what you're doing, though. — Peter Cordes, Dec 22 '20 at 09:12
The first read costs page faults or at least TLB misses on top of cache misses. My answer on [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) includes links to some details about that. If you touch one line per page first, or allocate with `mmap(MAP_POPULATE)`, it'll be difference. And yes, lines are 64 bytes on all modern x86 CPUs. — Peter Cordes, Dec 22 '20 at 09:14
Also, you concluded that clflush doesn't invalidate anything. To conclude that, you'd have to count cache misses with perf counters, and see that you got the same number with/without clflush. Not just that the 2nd time was faster - you haven't ruled out other reasons. Microbenchmarking is hard! — Peter Cordes, Dec 22 '20 at 09:19
Oh BTW, your inline asm statement is very unsafe: you modify RCX without a clobber, and you modify `%0` even though you asked for it as a read-only input. Use `_mm_clflush` from immintrin.h (and check the generated asm if you want to be sure). Also, don't use the `loop` instruction, it's slow on Intel. — Peter Cordes, Dec 22 '20 at 09:21

Why CLFLUSH affects nothing?

0 Answers0