0

I want to know if there will be a performance difference in the following loops?


.LOOP1:
        mov     edi, dword ptr [rbx]
        call    f(int)@PLT
        add     rbx, 4
        cmp     rbx, r14
        jne     .LOOP1

.LOOP2:
        mov     edi, dword ptr [r14]
        call    f(int)@PLT
        add     r14, 4
        cmp     r14, qword ptr [rbx + 8]
        jne     .LOOP2

The only significant difference is the cmp instruction. Assuming the cmp in LOOP1 will just take one cycle; Will the cmp in LOOP2 instruction take more than a cycle because it has to first compute [rbx + 8]?

EDIT: The code is coming from the two different ways we access elements of a vector. The range based for-loop is slightly better IMO if we consider one cmp to have different performance than the other.

https://godbolt.org/z/z9d94WGYh

#include <vector>

using namespace std;
int f(int);

void range(const vector<int> &input) {
    for (int i : input) {
        f(i);
    }
}

void iter(const vector<int> &input) {
    for (auto i = input.begin(); i != input.end(); ++i) {
        f(*i);
    }
}

$ gcc -O3

range(std::vector<int, std::allocator<int> > const&):
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     rbx, QWORD PTR [rdi]
        mov     rbp, QWORD PTR [rdi+8]
        cmp     rbp, rbx
        je      .L1
.L3:
        mov     edi, DWORD PTR [rbx]
        add     rbx, 4
        call    f(int)
        cmp     rbp, rbx
        jne     .L3
.L1:
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret
iter(std::vector<int, std::allocator<int> > const&):
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     rbx, QWORD PTR [rdi]
        cmp     rbx, QWORD PTR [rdi+8]
        je      .L7
        mov     rbp, rdi
.L9:
        mov     edi, DWORD PTR [rbx]
        add     rbx, 4
        call    f(int)
        cmp     QWORD PTR [rbp+8], rbx
        jne     .L9
.L7:
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret
A. K.
  • 34,395
  • 15
  • 52
  • 89
  • 1
    The loops could potentially run at the same speed; Intel CPUs can still micro-fuse and macro-fuse `cmp reg, mem`/`jcc` into a single load+compare+branch uop in the front-end, with that simple addressing mode. Probably AMD as well. It will take an extra load execution unit in the back-end. *Assuming the cmp in LOOP1 will just take one cycle* - that's not how performance works in modern x86. Skylake for example can run issue/rename 4 uops per clock cycle. [predicting latency for operations on modern superscalar processors and how can I calculate?](https://stackoverflow.com/q/51607391) – Peter Cordes Aug 15 '23 at 17:32
  • You could profile them and see for yourself. And how big a difference would you need to see for it to be worth worrying about? – Scott Hunter Aug 15 '23 at 17:33
  • Re: macro + micro fusion: see [What is instruction fusion in contemporary x86 processors?](https://stackoverflow.com/q/56413517) . (And of course https://agner.org/optimize/ - sounds like you need to read up on some basics of thinking about performance of modern CPUs. See also [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) which corrects the misconception that each instruction separate costs a certain number of cycles and that total time could be calculated by just add up a cycle count across instructions.) – Peter Cordes Aug 15 '23 at 17:35
  • 1
    one expects the memory reference to take a while, the pipe may hide that, or may not but how well it can hide it has to do with in part what else is going on along the memory path. – old_timer Aug 15 '23 at 20:26
  • 1
    a register access is expected to be a clock cycle but a memory reference can be from a few to hundreds of clock cycles. – old_timer Aug 15 '23 at 20:27
  • 1
    the rbx+8 math itself is essentially free. – old_timer Aug 15 '23 at 20:28
  • 2
    even cached the memory reference is not assumed to be one clock. depends on the pipe and this is an x86 with massive overhead (to make the pipe smoother)(relative to something high performance like an arm) – old_timer Aug 15 '23 at 20:29
  • 2
    Re "will be a performance difference in the following loops?" Unlikely. The expensive part of the code in the loop is here: `call f(int)@PLT`. But by all means, time or profile both variants to find out. Any performance difference of < 2% should be regarded as noise. – njuffa Aug 16 '23 at 00:50
  • 2
    @old_timer the latency of a load like that if it hits L1 is 4 (Zen, E-cores) or 5 cycles these days, sometimes that's bad but it's not very important in this case since it only holds up the p6 µop of the fused cmp/branch pair, not the rest of the loop – harold Aug 16 '23 at 08:19
  • The range-for is obviously better. It *might* not run any faster, but it won't be slower (unless there's some secondary effect like code alignment or something that just randomly happens to be worse instead of better with the different code from that.). Unless it's necessary to re-evaluate `input.end()` inside the loop (e.g. if `f()` can change the vector size via some other reference to the same vector we have a ref to), don't do it. – Peter Cordes Aug 17 '23 at 06:04
  • 1
    See also [Why do compilers miss vectorization here?](https://stackoverflow.com/q/76920847) - when a compiler thinks the loop-bound might not be a loop invariant, it defeats some loop optimizations like unrolling and vectorization. (Which would only be possible for your code after inlining `f()`, in which case it likely could prove that `f` wouldn't be changing the vector size.) – Peter Cordes Aug 17 '23 at 21:15
  • I've been analyzing the vector access for a while: https://github.com/llvm/llvm-project/issues/63328 and it is really frustrating to see this suboptimal behavior because of alias-analysis. i'm hoping compiler improves the codegen. – A. K. Aug 18 '23 at 18:00
  • I'd call that **[escape analysis](https://en.wikipedia.org/wiki/Escape_analysis)**, since you're calling a non-inline function that could modify *anything* that could potentially be pointed-to by a global variable. Alias analysis would be if the compiler was worried about two pointers that it can see maybe both pointing to the same thing. (Not a problem for Clang or GCC in this case or in the Q&A I linked in my last comment, since they do type-based alias analysis, `-fstrict-aliasing`, so they know that a `size_t` can't overlap with a `size_t*`, but could with another `size_t`.) – Peter Cordes Aug 19 '23 at 04:30
  • The biggest factor in improving code-gen is making sure the compiler can see the definition of `f()`, e.g. with `-flto` if it's not in a separately-compiled library. Even if it doesn't inline, inter-procedural analysis may notice that it can't change `.end` in the vector. And if it does inline `f`, so much the better. If `f` isn't tiny, the extra overhead of the extra load will be negligible. If `f` *is* tiny, it should inline. – Peter Cordes Aug 19 '23 at 04:35

0 Answers0