cacheline performance is different when storing different data

Question

I write a test program for x86 system. In the loop, there are four different store statements. If I uncomment statement1, the result is 3.2ns. The results for other statements are 2.2ns, 3.7ns, 2.6ns respectively. I can't understand these results. I think the first statement1 should be the fastest because it stores an immediate value and doesn't need to load the value at first like other statements.

Why those four statements have different speed. Could anyone explain them? Thanks.

$ ./a.out 0

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define BUF_SIZE 8192
#define ROUND 100000000UL
int main(int argc, char **argv)
{
    char *buf, *buf_newaddr, *buf_pageend;
    unsigned long i __attribute__((aligned(64)));
    int buf_realsize;
    unsigned long offset __attribute__((aligned(64)));
    struct timespec start={0,0}, end={0,0};
    double start_ns, end_ns;

    if (argc != 2) {
        printf("missing args\n");
        exit(-1);
    }

    offset = atoi(argv[1]);

again:
    buf = (void *)malloc(BUF_SIZE);
    buf_pageend = (void *)((unsigned long)(buf + 4095) & 0xfffffffffffff000UL);
    if (buf_pageend - buf < 1024) { // make sure we have enough space in case the 'offset' is negative
        // don't free, occupy it in order to alloc another different block
        goto again;
    }
    memset(buf, 0, BUF_SIZE);

    printf("&i = %lx, &offset=%lx\n", &i, &offset);
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (i = 0; i < ROUND; i++) {
        //*((unsigned long *)(buf_pageend + offset)) = 0; // 3.2ns
        //*((unsigned long *)(buf_pageend + offset)) = (unsigned long)(buf_pageend + offset); // 2.2ns
        //*((unsigned long *)(buf_pageend + offset)) = i; // 3.7ns
        //*((unsigned long *)(buf_pageend + offset)) = offset; // 2.6ns
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    start_ns = start.tv_sec*1000000000 + start.tv_nsec;
    end_ns = end.tv_sec*1000000000 + end.tv_nsec;
    printf("ns: %lf\n", (end_ns - start_ns)/ROUND);
}

EDIT 2022-10-30 17:43 for discussion in comments:

The asm for the second assignment statement is:

movq    -176(%rbp), %rdx
movq    -64(%rbp), %rax
leaq    (%rdx,%rax), %rcx
movq    -176(%rbp), %rdx // delete this line
movq    -64(%rbp), %rax // delete this line
addq    %rdx, %rax
movq    %rcx, (%rax)
movq    -112(%rbp), %rax
addq    $1, %rax
movq    %rax, -112(%rbp)

If I delete the two lines marked with //, the result will change from 2.2ns to 3.6ns.

Did you compile this with optimization disabled? I assume so, because `buf` isn't `volatile char*`. Also, because all of these should run at 1 store per clock cycle (after the initial page fault). `ROUND` is probably high enough that it should pretty much amortize that page-fault cost and give the CPU time to get to max turbo frequency. — Peter Cordes, Oct 29 '22 at 18:03
Also, what CPU did you test this on? If Intel, and you compiled without optimization, part of the effect might be Sandybridge-family's variable-latency store-forwarding: [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) . If specifically a Skylake-derived CPU, like Coffee Lake, it might also involve code alignment and [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) — Peter Cordes, Oct 29 '22 at 20:02
BTW, you don't need `goto` for that retry loop; that's what `do{}while()` is for. Oh, and you do memset it before the store loop, so you won't get page faults inside the loop. So it's just a store bandwidth test, or should be. Compilers should auto-vectorize if you enable full optimization, even for the one that stores `i` which changes inside the loop. — Peter Cordes, Oct 30 '22 at 03:50
@PeterCordes My cpu is Intel(R) Xeon(R) Platinum 8260. I have read the first link and it's really useful. Store-Forwarding is indeed a possible cause. I'm trying to study it further. But gcc optimization will eliminate my assignment in the loop. I'm trying to use Intel Performance Counter Monitor to find out what happened in my code. — haolee, Oct 30 '22 at 04:28
So you're actually measuring a loop with tons of overhead and other bottlenecks. :/ Like I said, `volatile` would force the store to happen (e.g. `*(volatile unsigned long *)(buf_pageend + offset)`, while allowing all other optimization. So would a `Benchmark::DoNotOptimize` on the array, to force the compiler to materialize its contents. Although you're not filling the array, just rewriting the same location repeatedly. — Peter Cordes, Oct 30 '22 at 05:28
@PeterCordes Oh I didn't realize the intention of volatile in your first comment. When you say it in the last comment, I suddenly realize its purpose. With volatile keyword, these assignments all have the same performance. Thanks for your help!! — haolee, Oct 30 '22 at 05:56
@PeterCordes Hello, Does it make sense if I continue to try to find out which factor causes the stall? perf counter shows that the difference between the first two assignment statements is from `uops_executed.stall_cycles` and `cycle_activity.cycles_mem_any` instead of `ld_blocks.store_forward`. I'm curious about what is the true reason for the stall. — haolee, Oct 30 '22 at 09:12
In the asm generated with `-O0`? If you're curious about the CPU-architecture details, sure. The compiler storing and reloading `i` won't be a store-forwarding *stall*, so you won't get counts for `ld_blocks.store_forward`. You'll get successful store-forwarding with the associated 3 to 5 cycle latency. See [What are the costs of failed store-to-load forwarding on x86?](https://stackoverflow.com/a/69631247) for an asm experiment that does cause store-forwarding stalls: a wide load after a narrow store. — Peter Cordes, Oct 30 '22 at 09:28
@PeterCordes Yes, I compile it with `-O0`. If the stall is not caused by store_forward, what could be the reason for `cycle_activity.cycles_mem_any`. I append more information to the original question. — haolee, Oct 30 '22 at 09:45
Load latency is stalling execution progress in the core, but store-to-load forwarding is taking the 3 to 5 cycle latency fast path which pipelines, not the 15 cycle or so latency "SF stall" which doesn't pipeline with other SF stalls. `ld_blocks.store_forward` counts loads that can't store-forward efficiently, not cases where execution is blocked by the latency of normal fast-path store-forwarding. (Only a few CPUs have had zero-latency store forwarding, like Zen 2 https://www.agner.org/forum/viewtopic.php?t=41 and I think Ice Lake) — Peter Cordes, Oct 30 '22 at 09:57
See [Can modern x86 implementations store-forward from more than one prior store?](https://stackoverflow.com/q/46135766) for the kinds of things `ld_blocks.store_forward` counts. — Peter Cordes, Oct 30 '22 at 09:58

cacheline performance is different when storing different data

0 Answers0