Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even though result is used

Question

I have written two functions that gets the sum of an array, the first one is written in C++ and the other is written with inline assembly (x86-64), I compared the performance of the two functions on my device.

If the -O flag is not enabled during compilation the function with inline assembly is almost 4-5x faster than the C++ version.

cpp time : 543070068 nanoseconds
cpp time : 547990578 nanoseconds

asm time : 185495494 nanoseconds
asm time : 188597476 nanoseconds

If the -O flag is set to -O1 they produce the same performance.

cpp time : 177510914 nanoseconds
cpp time : 178084988 nanoseconds

asm time : 179036546 nanoseconds
asm time : 181641378 nanoseconds

But if I try to set the -O flag to -O2 or -O3 I'm getting an unusual 2-3 digit nanoseconds performance for the function written with inline assembly which is sketchy fast (at least for me, please bear with me since I have no rock solid experience with assembly programming so I don't know how fast or how slow it can be compared to a program written in C++. )
```
cpp time : 177522894 nanoseconds
cpp time : 183816275 nanoseconds

asm time : 125 nanoseconds
asm time : 75 nanoseconds
```

My Questions

Why is this array sum function written with inline assembly so fast after enabling -O2 or -O3?
Is this a normal reading or there is something wrong with the timing/measurement of the performance?
Or maybe there is something wrong with my inline assembly function?
And if the inline assembly function for the array sum is correct and the performance reading is correct, why does the C++ compiler failed to optimize a simple array sum function for the C++ version and make it as fast as the inline assembly version?

I have also speculated that maybe the memory alignment and cache misses are improved during compilation to increase the performance but my knowledge on this one is still very very limited.

Apart from answering my questions, if you have something to add please feel free to do so, I hope somebody can explain, thanks!

[EDIT]

So I have removed the use of macro and isolated running the two version and also tried to add volatile keyword, a "memory" clobber and "+&r" constraint for the output and the performance was now the same with the cpp_sum.

Though if I remove back the volatile keyword and "memory" clobber it I'm still getting those 2-3 digit nanoseconds performance.

code:

#include <iostream>
#include <random>
#include <chrono>

uint64_t sum_cpp(const uint64_t *numbers, size_t length) {
    uint64_t sum = 0;
    for(size_t i=0; i<length; ++i) {
        sum += numbers[i];
    }
    return sum;
}

uint64_t sum_asm(const uint64_t *numbers, size_t length) {
    uint64_t sum = 0;
    asm volatile(
        "xorq %%rax, %%rax\n\t"
        "%=:\n\t"
        "addq (%[numbers], %%rax, 8), %[sum]\n\t"
        "incq %%rax\n\t"
        "cmpq %%rax, %[length]\n\t"
        "jne %=b"
        : [sum]"+&r"(sum)
        : [numbers]"r"(numbers), [length]"r"(length)
        : "%rax", "memory", "cc"
    );
    return sum;
}

int main() {
    std::mt19937_64 rand_engine(1);
    std::uniform_int_distribution<uint64_t> random_number(0,5000);

    size_t length = 99999999;
    uint64_t *arr = new uint64_t[length];
    for(size_t i=1; i<length; ++i) arr[i] = random_number(rand_engine);

    uint64_t cpp_total = 0, asm_total = 0;

    for(size_t i=0; i<5; ++i) {
        auto start = std::chrono::high_resolution_clock::now();
#ifndef _INLINE_ASM
        cpp_total += sum_cpp(arr, length);
#else
        asm_total += sum_asm(arr,length);
#endif
        auto end = std::chrono::high_resolution_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start);
        std::cout << "time : " << dur.count() << " nanoseconds\n";
    }

#ifndef _INLINE_ASM
    std::cout << "cpp sum = " << cpp_total << "\n";
#else
    std::cout << "asm sum = " << asm_total << "\n";
#endif

    delete [] arr;
    return 0;
}

The first thing that strikes me is that you measure both versions on the same run one after the other. This alone invalidates your results because of caching. The second big problem I see is the macro. I don't know if that influences in a bad way the results, but for the love of god, just use regular functions. — bolov, May 18 '22 at 01:10
Clearly something's optimizing away. `asm` without `volatile` tells the compiler it's a pure function of the inputs operands you tell the compiler about. And you're missing a `"memory"` clobber so the compiler doesn't even have to have pointed-to memory in sync. But those probably don't explain optimizing away the loop. It's also missing an early clobber `"+&r"` which would in theory let it pick the same register for `sum` as for one of the inputs, but it could only do that in `numbers` or `length` were also `0`. Definitely check the compiler-generated asm: https://godbolt.org/z/KeEMfoMvo — Peter Cordes, May 18 '22 at 01:12
Also that `-O2` and higher enables `-finline-functions`, while `-O1` only enables `-finline-functions-called-once` (and this isn't `static` or `inline`), so the version inlined into main is the key. 75ns is plausibly accounted for by timing overhead of `std::chrono` functions around a nearly-empty timed region. — Peter Cordes, May 18 '22 at 01:16
@bolov: Timing page faults in the first pass over fresh memory is [a common mistake](https://stackoverflow.com/questions/60291987/idiomatic-way-of-performance-evaluation) but that's not happening here. There's already a slow RNG init loop that writes every array element to get that out of the way (and warm up the CPU frequency to max turbo) before even the first `cpp` timed region. Also, the output shows two equal-time CPP passes, not a first slow one. — Peter Cordes, May 18 '22 at 01:22
A couple other notes: It's (a tiny bit) more efficient to zero RAX by using `xor %%eax, %%eax`. Because of the way modifying 32bit register works, this does indeed clear the upper bits, but is a shorter command. Also, since all you're doing is summing, it doesn't matter whether you count up or down, you'll get the same result. But counting down allows you to omit the CMPQ, and just use the flags set by DEC. And using a size_t for the number of elements seems excessive. You'd have to allocate 34 gig of memory to hold enough elements to exceed a uint32. Doesn't explain your numbers, but still. — David Wohlferd, May 18 '22 at 01:28
BTW, apart from [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) (or `asm volatile` plus a `memory` clobber), this is not bad asm. I'd normally use a pointer increment to avoid [unlamination of an indexed addressing](//stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes) mode on old Sandy / Ivy Bridge CPUs (`do{}while(p != arr + length);`), but this isn't even using SSE2 which is baseline for x86-64. Could go at least twice as fast if data fits in L2 or L1d cache so mem bandwidth isn't the limi — Peter Cordes, May 18 '22 at 01:29
I would suggest using a micro benchmarking library, i believe theres a few out there. ive used [nonius](https://github.com/rmartinho/nonius) in the past (which i think was integrated into catch2), and i believe theres also google benchmark. — Borgleader, May 18 '22 at 01:32
@DavidWohlferd: Some hardware prefetchers work better in the forward direction, so it's normally best to traverse in that direction for large arrays. If you want to avoid `cmp`, get a pointer to the end of the array and index from `-length` up towards zero. Like `(arr+len)[idx]` with `for(idx=-len ; idx != 0 ; idx++)`. But usually even better to unroll some if loop overhead is a problem. And to allow more instruction-level parallelism: `add` (and branching) has 1 cycle latency, so you're missing out on doing 2 loads per clock cycle (Intel since SnB, AMD since K8). Use two sum registers. — Peter Cordes, May 18 '22 at 01:32
Oh right, the other thing that happens without a `"memory"` clobber is that the asm statement isn't ordered wrt. function calls, so it actually *can* hoist the asm statement out of the loop. Almost a duplicate of [How does Google's \`DoNotOptimize()\` function enforce statement ordering](https://stackoverflow.com/q/69287053) which points out that effect of the `"memory"` clobber. — Peter Cordes, May 18 '22 at 01:37
BTW, updated my answer with some new advice, since your question talking about benchmarking without even `-O1` seems to indicate you're pretty new to this in general, not just to *inline* asm. — Peter Cordes, May 18 '22 at 04:29
The thing you should be benchmarking is `-Os` and `-O2`. Maybe also throw in a test with `-fvectorize` and/or `alignas(16)` (or 32 or 64) for the array to see if it will do SIMD. Next look at the asm generated when you compile. Is your asm function even in there? Is in the loop? After that you can worry about the timing. — Goswin von Brederlow, May 18 '22 at 08:46

Peter Cordes · Accepted Answer · 2022-05-18T04:28:34.330

The compiler is hoisting the inline asm out of your repeat loop, and thus out of your timed region.

If your goal is performance, https://gcc.gnu.org/wiki/DontUseInlineAsm. The useful thing to spend your time learning first is SIMD intrinsics (and how they compile to asm) like _mm256_add_epi64 to add 4x uint64_t with a single AVX2 instruction. See https://stackoverflow.com/tags/sse/info (Compilers can auto-vectorize decently for a simple sum like this, which you could see the benefit from if you used a smaller array and put a repeat loop inside the timed region to get some cache hits.)

If you want to play around with asm to test what's actually fast on various CPUs, you can do that in a stand-alone static executable, or a function you call from C++. https://stackoverflow.com/tags/x86/info has some good performance links.

Re: benchmarking at -O0, yes the compiler makes slow asm with the default -O0 of consistent debugging and not trying at all to optimize. It's not much of a challenge to beat it when it has its hands tied behind its back.

Why your `asm` can get hoisted out of the timed regions

Without being asm volatile, your asm statement is a pure function of the inputs you've told the compiler about, which are a pointer, a length, and the initial value of sum=0. It does not include the pointed-to memory because you didn't use a dummy "m" input for that. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

Without a "memory" clobber, your asm statement isn't ordered wrt. function calls, so GCC is hoisting the asm statement out of the loop. See How does Google's `DoNotOptimize()` function enforce statement ordering for more details about that effect of the "memory" clobber.

Have a look at the compiler output on https://godbolt.org/z/KeEMfoMvo and see how it inlined into main. -O2 and higher enables -finline-functions, while -O1 only enables -finline-functions-called-once and this isn't static or inline so it has to emit a stand-alone definition in case of calls from other compilation units.

75ns is just the timing overhead of std::chrono functions around a nearly-empty timed region. It is actually running, just not inside the timed regions. You can see this if you single-step the asm of your whole program, or for example set a breakpoint on the asm statement. When doing asm-level debugging of the executable, you could help yourself find it by putting a funky instruction like mov $0xdeadbeef, %eax before xor %eax,%eax, something you can search for in the debugger's disassembly output (like GDB's layout asm or layout reg; see asm debugging tips at the bottom of https://stackoverflow.com/tags/x86/info). And yes, you do often want to look at what the compiler did when debugging inline asm, how it filled in your constraints, because stepping on its toes is a very real possibility.

Note that a "memory" clobber without asm volatile would still let GCC do Common Subexpression Elimination (CSE) between two invocations of the asm statement, if there was no function call in between. Like if you put a repeat loop inside a timed region to test performance on an array small enough to fit in some level of cache.

Sanity-checking your benchmark

Is this a normal reading

It's wild that you even have to ask that. 99999999 8-byte integers in 75ns would be a memory bandwidth of 99999999 * 8 B / 75 ns = 10666666 GB/s, while fast dual-channel DDR4 might hit 32 GB/s. (Or cache bandwidth if it was that large, but it's not, so your code bottlenecks on memory).

Or a 4GHz CPU would have had to run at 99999999 / (75*4) = 333333.33 add instructions per clock cycle, but the pipeline is only 4 to 6 uops wide on modern CPUs, with taken-branch throughputs of at best 1 for a loop branch. (https://uops.info/ and https://agner.org/optimize/)

Even with AVX-512, that's 2/clock 8x uint64_t additions per core, but compilers don't rewrite your inline asm; that would defeat its purpose compared to using plain C++ or intrinsics.

This is pretty obviously just std::chrono timing overhead from a near-empty timed region.

Asm code-review: correctness

As mentioned above, How can I indicate that the memory *pointed* to by an inline ASM argument may be used?

You're also missing an & early clobber declaration in "+&r"(sum) which would in theory let it pick the same register for sum as for one of the inputs. But since sum is also an input, it could only do that if numbers or length were also 0.

It's kind of a toss-up whether it's better to xor-zero inside the asm for an "=&r" output, or better to use "+&r" and leave that zeroing to the compiler. For your loop counter, it makes sense because the compiler doesn't need to know about that at all. But by manually picking RAX for it (with a clobber), you're preventing the compiler from choosing to have your code produce sum in RAX, like it would want for a non-inline function. A dummy [idx] "=&r" (dummy) output operand will get the compiler to pick a register for you, of the appropriate width, e.g. intptr_t.

Asm code review: performance

As David Wohlferd said: xor %eax, %eax to zero RAX. Implicit zero-extension saves a REX prefix. (1 byte of code-size in the machine code. Smaller machine-code is generally better.)

It doesn't seem worth hand-writing asm if you're not going to do anything smarter than what GCC would on its own without -ftree-vectorize or with -mgeneral-regs-only or -mno-sse2 (even though it's baseline for x86-64, kernel code generally needs to avoid SIMD registers). But I guess it works as a learning exercise in how inline asm constraints work, and a starting point for measuring. And to get a benchmark working so you can then test better loops.

Typical x86-64 CPUs can do 2 loads per clock cycle (Intel since Sandybridge, AMD since K8) Or 3/clock on Alder Lake. On modern CPUs with AVX/AVX2, each load can be 32 bytes wide (or 64 bytes with AVX-512) best case on L1d hits. Or more like 1/clock with only L2 hits on recent Intel, which is a reasonable cache-blocking target.

But your loop can at best run 1x 8-byte load per clock cycle, because loop branches can run 1/clock, and add mem, %[sum] has a 1 cycle loop-carried dependency through sum.

That might max out DRAM bandwidth (with the help of HW prefetchers), e.g. 8 B / cycle * 4GHz = 32GB/s, which modern desktop/laptop Intel CPUs can manage for a single core (but not big Xeons). But with fast enough DRAM and/or a slower CPU relative to it, even DRAM can avoid being a bottleneck. But aiming for DRAM bandwidth is quite a low bar compared to L3 or L2 cache bandwidth.

So even if you want to keep using scalar code without movdqu / paddq (or better get to an alignment boundary for memory-source paddq, if you want to spend some code-size to optimize this loop), you could still unroll with two register accumulators for sum which you add at the end. This exposes some instruction-level parallelism, allowing two memory-source loads per clock cycle.

You can also avoid the cmp, which can reduce loop overhead. Fewer uops lets out-of-order exec see farther.

Get a pointer to the end of the array and index from -length up towards zero. Like (arr+len)[idx] with for(idx=-len ; idx != 0 ; idx++). Looping backwards through the array is on some CPUs a little worse for some of the HW prefetchers, so generally not recommended for loops that are often memory bound.

See also Micro fusion and addressing modes - an indexed addressing mode can only stay micro-fused in the back-end on Intel Haswell and later, and only for instructions like add that RMW their destination register.

So your best bet would be a loop with one pointer increment and 2 to 4 add instructions using it, and a cmp/jne at the bottom.

Inline assembly array sum benchmark near-zero time for large arrays with optimization enabled, even though result is used

1 Answers1

Why your asm can get hoisted out of the timed regions

Sanity-checking your benchmark

Asm code-review: correctness

Asm code review: performance

Why your `asm` can get hoisted out of the timed regions