Why one line of code to decline performance 10x?

Question

The line of code

next += val;

declines performance to 10x, I have checked ASM code, not result.

Why this line of code declines performance to 10x?

Here is the result:

➜  ~ clang-13 1.c -O3
➜  ~ ./a.out
rand_read_1
sum = 2624b18779c40, time = 0.19s
rand_read_2
sum = 2624b18779c40, time = 1.24s

CPU: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz

Here is the code:

#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <assert.h>
#include <stdlib.h>

#define CCR_MULTIPLY_64           6364136223846793005
#define CCR_ADD_64                1
static inline uint64_t my_rand64(uint64_t *r)
{
    *r = *r * CCR_MULTIPLY_64 + CCR_ADD_64;
    return *r;
}

#define NUM 10000000UL

uint64_t rand_read_1(uint64_t *ptr, uint64_t nr_words)
{
    uint64_t i, next, val = 0;
    uint64_t sum;

    next = 0;
    sum = 0;
    for (i = 0; i < NUM; i++) {
        my_rand64(&next);
        next %= nr_words;
        val = ptr[next];
        sum += val ^ next;
        // printf("next1:%ld\n", next);
    }

    return sum;
}

uint64_t rand_read_2(uint64_t *ptr, uint64_t nr_words)
{
    uint64_t i, next, val ,next2 = 0;
    uint64_t sum;

    next = 0;
    sum = 0;
    for (i = 0; i < NUM; i++) {
        my_rand64(&next);
        next %= nr_words;
        val = ptr[next];
        sum += val ^ next;
        next += val;
    }

    return sum;
}

#define SIZE    (1024*1024*1024)

static uint64_t get_ns(void)
{
    struct timespec val;
    uint64_t v;
    int ret;

    ret = clock_gettime(CLOCK_REALTIME, &val);
    if (ret != 0) {
        perror("clock_gettime");
        exit(1);
    }
    v  = (uint64_t) val.tv_sec * 1000000000LL;
    v += (uint64_t) val.tv_nsec;
    return v;
}

int main(int argc, char *argv[])
{
    uint64_t *ptr;
    uint64_t sum;
    uint64_t t0, t1, td, t2;

    ptr = (uint64_t *)malloc(SIZE);
    assert(ptr);

    memset(ptr, 0, SIZE);

    t0 = get_ns();
    printf("rand_read_1\n");
    sum = rand_read_1(ptr, SIZE/8);
    t1 = get_ns();
    td = t1 - t0;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    printf("rand_read_2\n");
    sum = rand_read_2(ptr, SIZE/8);
    t2 = get_ns();
    td = t2 - t1;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    return 0;
}

The first one is slower since it has to "warm up" the cache used by stdout/printf? What happens if you comment out the `rand_read` calls? — Lundin, Jan 12 '23 at 09:15
`malloc(SIZE);` ... `memset(ptr, 0, SIZE);` Use `calloc`. Or better yet default init the whole chunk to some random goo using `volatile` access, and then zero it out afterwards. So that you guarantee that late heap allocation by the OS in itself does not affect the benchmark. — Lundin, Jan 12 '23 at 09:16
@Casey When editing code then NEVER change coding style. Your sloppy vandalism edits introduced bugs in the code that weren't there in the first place. — Lundin, Jan 12 '23 at 09:24
I'll rollback vandalism and then edit it correctly, hang on... — Lundin, Jan 12 '23 at 09:24
@chqrlie I agree with your points; however, I *removed* the c++ tag and added the c tag in my edit. You may be conflating the rollback edit with my original. — Casey, Jan 13 '23 at 20:36
@Casey: OK, my bad, I misread the edit history. Calling your edit *vandalism* is quite exaggerated. — chqrlie, Jan 14 '23 at 08:20

harold · Accepted Answer · 2023-01-12T05:03:48.680

The method of benchmarking is a bit dodgy, but this is a real effect.

next += val; changes something fundamental about the structure of the code: it makes each memory read depend on the result of the previous read. Without that line, the reads are independent (there is a shorter loop-carried dependency chain through my_rand64, which the memory read is not a part of).

Essentially with that line it's a latency benchmark, and without that line, it's a throughput benchmark. Latency and throughput differing by a factor of 10 is reasonable for memory reads.

At the assembly level, without that line the asm looks like this when compiled with Clang

.LBB2_3:                                # =>This Inner Loop Header: Depth=1
    imul    rcx, r15
    add     rcx, 1
    mov     edx, ecx
    and     edx, 134217727
    xor     rdx, qword ptr [r14 + 8*rdx]
    mov     esi, r15d
    imul    esi, ecx
    add     rdx, rbx
    add     esi, 1
    and     esi, 134217727
    mov     rbx, qword ptr [r14 + 8*rsi]
    xor     rbx, rsi
    add     rbx, rdx
    mov     rcx, rsi
    add     rax, -2
    jne     .LBB2_3

uiCA estimates 9.16 cycles per iteration (the loop was unrolled by a factor of 2, so this corresponds to about 4.5 cycles per iteration of the original loop), but it does not take cache misses into account.

With that line, the assembly looks nearly the same, but that doesn't mean it runs in nearly the same way:

.LBB2_6:                                # =>This Inner Loop Header: Depth=1
    imul    ecx, r15d
    add     ecx, 1
    and     ecx, 134217727
    mov     rdx, qword ptr [r14 + 8*rcx]
    mov     rsi, rcx
    xor     rsi, rdx
    add     rsi, rbx
    add     edx, ecx
    imul    edx, r15d
    add     edx, 1
    and     edx, 134217727
    mov     rcx, qword ptr [r14 + 8*rdx]
    mov     rbx, rdx
    xor     rbx, rcx
    add     rbx, rsi
    add     rdx, rcx
    mov     rcx, rdx
    add     rax, -2
    jne     .LBB2_6

Now uiCA estimates 24.11 cycles per iteration (this loop was also unrolled by 2x), again without taking cache misses into account.

Thanks, I also doubt that this line of code decline the throughput of CPU pipeline. — zhanglistar, Jan 12 '23 at 05:58
@zhanglistar Intel VTune for example, or `perf` on Linux, can do analysis of code as it runs, so they can show effects that tools such as LLVM-MCA or uiCA cannot analyze. — harold, Jan 12 '23 at 07:29
How do you generate this assembly code? I have different assembly code. Besides, from assembly code, modifying the variable next is register operation not memory. I still have some confusion. Thanks. — zhanglistar, Jan 13 '23 at 07:08
@zhanglistar I got the assembly [like this](https://godbolt.org/z/6rzdrb5be) (scroll down a bit, you need the inlined version where the modulo has been optimized, not the standalone-function-version with `div`). You're right that `next += val` modifies a register, but that is exactly the problem: the register being modified in that way means that `add rdx, rcx` has to wait for the load to complete, then `mov rcx, rdx` has to wait for that `add` (and therefore indirectly also waits for the load) and so on until it ties back into the next load — harold, Jan 13 '23 at 07:54
@zhanglistar you may read [A whirlwind introduction to dataflow graphs](https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/) as a reference for what I'm talking about with those dependencies and instructions waiting for other instructions — harold, Jan 13 '23 at 07:57

score -1 · Answer 2 · answered Jan 12 '23 at 10:04

-1

Some notes regarding how to not do benchmarking (benchmarking is very hard):

malloc is very expensive and you allocate a lot of memory here. However, the OS may not actually allocate it before it notices that the memory is being used. So unless you force the OS to actually allocate the memory before the benchmarking starts, you'll be benchmarking how slow malloc is and nothing else, which is a gigantic performance killer next to your little algorithm.

Additionally, you may allocate too much memory for your OS to handle and you also do not necessary allocate aligned multiples of sizeof(uint64_t) which is a bug.

You could do something like this instead (might have to reduce SIZE first):
```
ptr = (uint64_t *)malloc(SIZE * sizeof(uint64_t));
assert(ptr);
for(size_t i=0; i<SIZE; i++)
{
  volatile uint64_t tmp = 123; // some garbage value
  ptr[i] = tmp; // the compiler has to load 123 from stack to heap
}
// actual heap allocation should now be done
```
The memset to zero probably does not count as "touching" the heap memory and potentially the compiler might even optimize that out by swapping to calloc. When I'm checking the optimized code (gcc x86_64 Linux) this is indeed what is happening. So this does not achieve the "touching the heap" thing described above.
printf and the stdout buffers are performance-heavy functions that should never be placed inside the benchmarking code. You might just end up benchmarking how slow printf is. For example changing your code to
```
printf("rand_read_1\n");
t0 = get_ns();
sum = rand_read_1(ptr, SIZE/8);
t1 = get_ns();
```
gave vastly different results.
The last td = t2 - t1; is nonsense since you measure every crap unrelated to your algorithm that has happened since last measurement. Including any number of context switches by the OS...

With all these bug fixes applied, main() might look like this instead:

int main(int argc, char *argv[])
{
    uint64_t *ptr;
    uint64_t sum;
    uint64_t t0, t1, td, t2;

    ptr = malloc(SIZE * sizeof(uint64_t));
    assert(ptr);
    for(size_t i=0; i<SIZE; i++)
    {
      volatile uint64_t tmp = 123; // some garbage value
      ptr[i] = tmp; // the compiler has to load 123 from stack to heap
    }

    printf("rand_read_1\n");
    t0 = get_ns();
    sum = rand_read_1(ptr, SIZE/8);
    t1 = get_ns();
    td = t1 - t0;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    printf("rand_read_2\n");
    
    t0 = get_ns();
    sum = rand_read_2(ptr, SIZE/8);
    t1 = get_ns();
    td = t1 - t0;
    printf("sum = %lx, time = %.2fs\n", sum, td/1E9);
    return 0;
}

In addition, I would perhaps also advise to execute method 1, then method 2, then method 1 then method 2, to ensure that the benchmarking isn't biased to the first algorithm loading values to cache and the second reusing them, or biased towards the eventual context switch that will happen x time units after launching your program.

From there on you can start to measure and consider the performance of your actual algorithm, which is not what you are currently doing. I would suggest that you post a separate question about that.

answered Jan 12 '23 at 10:04

Lundin

195,001
40
254
396

The code assumes `sizeof(uint64_t) = 8`, which is true on the system they're benchmarking on. It's not as portable, but it doesn't actually read outside the allocated size: the `rand_read_1(ptr, SIZE/8);` and that division is truncating, rounding down in case SIZE allocated a partial uint64_t. Also, if sizeof is smaller, it'll use less of the allocated size, and I think it's not possible for `sizeof(uint64_t)` to be greater than 8; `CHAR_BIT` is guaranteed to be at least 8, and `uint64_t` is guaranteed to be 64 value bits with no padding. – Peter Cordes Jan 12 '23 at 10:10
@PeterCordes I only know that when I ran it with `#define SIZE (1024*1024*1024)` it crashed for running out of memory and when I therefore changed it to `#define SIZE (1024*1024)` it segfaulted instead. `sizeof(uint64_t) * SIZE` solved the seg fault. There might be some other bug causing that, I suppose. I didn't even look at the actual algorithm since the benchmarking itself was fubar. – Lundin Jan 12 '23 at 10:13
You're right that GCC will optimize `malloc+memset(0)` into `calloc`, which doesn't actually touch the memory first, so the first access to each group of pages will pay for a page fault. This does 10M accesses over 1GiB (256K 4k pages or 512x 2M largepages), so on average (if the LCG PRNG touches every page) each page fault is amortized over at worst 38.1 page faults (4k pages without fault-around), or 19.5k pagefaults on 2M hugepages. – Peter Cordes Jan 12 '23 at 10:15
IDK, maybe I misread the code; I didn't try it. I just did try it with `clang -O3` on my x86-64 Linux system, and it runs without crashing. (0.02 and 0.14 sec on i7-6700k Skylake. IDK how the OP got such hugely higher numbers, although memory latency is known to be a lot higher on Skylake-server. My stdout was a KDE `konsole` terminal window; printf inside the timed region is bad, but still trivial vs. 0.03 seconds.) – Peter Cordes Jan 12 '23 at 10:16
@PeterCordes https://godbolt.org/z/vcfKsxahx Is what I tried with. `1024*1024*1024` followed by `malloc(SIZE)` gives exit code 139 (out of memory). `1024*1024*1024 * sizeof(uint64_t)` triggers the assert. `1024*1024*sizeof(uint64_t)` works. – Lundin Jan 12 '23 at 10:22
And there's still some hiccup in there because which one of method 1 and 2 that gets executed first has heavy impact on the result. – Lundin Jan 12 '23 at 10:24
Anyway this is just further evidence of the pile that approximately 100% of all performance benchmarking questions on SO could have their results explained by incorrect benchmarking, not by the actual algorithm... – Lundin Jan 12 '23 at 10:26
Your change makes it allocate 8GiB instead of the OP's 1GiB. Both run fine on my machine (with 32GiB), but of course Godbolt won't let it allocate that much. Yes, of course it's terrible benchmark methodology, with only one of them paying for the page fault costs. But it's the faster one first, and it's still faster despite paying for those page faults. – Peter Cordes Jan 12 '23 at 10:26
Harold's answer is correct, there is a real effect here. Tying the load result back into the address for the next load slows it down by about a factor of `max_memory_parallelism` of whatever CPU. On Skylake, that's 12 LFBs; ability to track cache misses to 12 incoming lines from L1d, so a fact of 10 is just about right. This isn't just a duplicate of [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Jan 12 '23 at 10:28
Oh, but the original benchmark with `calloc` allocation was potentially getting L2 cache hits (and maybe some dTLB misses), with all the pages CoW mapped to the same 4k or 2M page of zeros. So maybe more like Harold's uiCA results but with load latency being L2 hit latency instead of the L1d hit latency it's using? – Peter Cordes Jan 12 '23 at 10:33
Thanks, but the result is the same. – zhanglistar Jan 12 '23 at 12:00
@PeterCordes So maybe the test should simply be `#ifdef TEST1 run_test1(); #else run_test2(); #endif` to ensure that they aren't comparing apples and oranges. – Lundin Jan 12 '23 at 12:00
@zhanglistar No it isn't. I get completely different results after correcting your benchmarking bugs. One algorithm is likely faster than the other, but if you can't even measure it correctly, then how can you tell? – Lundin Jan 12 '23 at 12:01
Your non-zero init loop already warms up the memory so one after the other is ok. Your corrected benchmark loop correctly measures the same thing the original code was already sloppily measuring, just with an 8x larger array. Like @zhanglistar, the speed ratio is about the same on my x86-64 Arch Linux system. What "completely different" result did you get, on what system? (You don't need to *read* `123` from a `volatile tmp` every loop iteration, though; if you want to stop dead stores from optimizing away, the assignments to the array could be through `volatile uint64_t *pv` or something.) – Peter Cordes Jan 12 '23 at 12:20
In general, testing them in the other order is a good way to make sure the result is real, especially when the first one is slower. Testing only one is also useful in a case where you're worried the compiler might optimize across benchmarks, and/or the first one might be too good a warm-up for the 2nd. Or if you want to profile it with `perf`, then you want minimal startup overhead and as much of the run time as possible spent in the actual benchmark. – Peter Cordes Jan 12 '23 at 12:23
@PeterCordes Quite recently I've learned that many compilers including gcc have shaky support for volatile access vs access of volatile object. Due to some language lawyer detail of [DR476](https://www.open-std.org/jtc1/sc22/wg14/www/docs/summary.htm#dr_476). This DR is implemented in C23 but not yet live in gcc. – Lundin Jan 12 '23 at 12:24
One easy and efficient fix would be to `memset(..., 1, size)` then `asm("" ::: "memory")` to force that to be in sync with the abstract machine, then `memset(..., 0, size)` to zero it before use, if you don't trust `volatile uint64_t*`. It's not like having `volatile` on `uint64_t temp` is helping anything; all the values read from it are part of dead stores. Deref of a `volatile*` works well enough in most cases, though. – Peter Cordes Jan 12 '23 at 12:28
BTW, on my system where the OP's original code gives similar ratios of performance, `/sys/kernel/mm/transparent_hugepage/enabled` is set to `always`, and `/sys/kernel/mm/transparent_hugepage/defrag` is set to `defer+madvise`. And I probably have enough free memory that the kernel was probably able to find a fair amount of contiguous 2M hugepages, so even paying for the page-faults in the first timed region didn't slow it tremendously. I could imagine that might be relevant for different results on different systems. – Peter Cordes Jan 12 '23 at 12:32
I didn't downvote. There is definitely a real effect here, though, and your answer seems to be saying that the *only* effect is bad benchmark methodology so I didn't upvote. A better-designed benchmark like yours would have made the original question much better, but it's a big enough effect that there's still a real question; like I said, I can reproduce it on my machine with similar results from the original bad benchmark as from your fixed version. – Peter Cordes Jan 12 '23 at 12:33

Why one line of code to decline performance 10x?

2 Answers2

Linked