why is std::equal much slower than a hand rolled loop for two small std::array?

Question

I was profiling a small piece of code that is part of a larger simulation, and to my surprise, the STL function equal (std::equal) is much slower than a simple for-loop, comparing the two arrays element by element. I wrote a small test case, which I believe to be a fair comparison between the two, and the difference, using g++ 6.1.1 from the Debian archives is not insignificant. I am comparing two, four-element arrays of signed integers. I tested std::equal, operator==, and a small for loop. I didn't use std::chrono for an exact timing, but the difference can be seen explicitly with time ./a.out.

My question is, given the sample code below, why does operator== and the overloaded function std::equal (which calls operator== I believe) take approx 40s to complete, and the hand written loop take only 8s? I'm using a very recent intel based laptop. The for-loop is faster on all optimizations levels, -O1, -O2, -O3, and -Ofast. I compiled the code with g++ -std=c++14 -Ofast -march=native -mtune=native

Run the code

The loop runs a huge number of times, just to make the difference clear to the naked eye. The modulo operators represent a cheap operation on one of the array elements, and serve to keep the compiler from optimizing out of the loop.

#include<iostream>
#include<algorithm>
#include<array>

using namespace std;
using T = array<int32_t, 4>;

bool 
are_equal_manual(const T& L, const T& R)
noexcept {
    bool test{ true };
    for(uint32_t i{0}; i < 4; ++i) { test = test && (L[i] == R[i]); }
    return test;
}

bool
are_equal_alg(const T& L, const T& R)
noexcept {
    bool test{ equal(cbegin(L),cend(L),cbegin(R)) };
    return test;
}

int main(int argc, char** argv) {

    T left{ {0,1,2,3} };
    T right{ {0,1,2,3} };

    cout << boolalpha << are_equal_manual(left,right) << endl;
    cout << boolalpha << are_equal_alg(left,right) << endl;
    cout << boolalpha << (left == right) << endl;

    bool t{};
    const size_t N{ 5000000000 };
    for(size_t i{}; i < N; ++i) {
      //t = left == right; // SLOW
      //t = are_equal_manual(left,right); // FAST
        t = are_equal_alg(left,right);  // SLOW
      left[0] = i % 10;
      right[2] = i % 8;
    }

    cout<< boolalpha << t << endl;

    return(EXIT_SUCCESS);
}

Could not reproduce your result. All three versions have the same performance, although I would expect yours to be slower since it always runs the loop to the end (whereas it can stop as soon as it hits a pair of elements that do not compare equal). — Leon, Sep 01 '16 at 04:18
[Assembly output](https://godbolt.org/g/L5ioi4), manual compare is unrolled cmpl, whereas `equal` and `==` is using `memcmp` — Bryan Chen, Sep 01 '16 at 04:23
Leon - I was able to reproduce it on coliru as well. Have you tried running the code with the link I provided? Which version of GCC are you using, and on which platform? — KBentley57, Sep 01 '16 at 04:38
If you pass your own lambda `bool test{ equal(cbegin(L),cend(L),cbegin(R), [](int a, int b) { return a == b; })};` the compiler can optimize it better it seems. — Jesse Good, Sep 01 '16 at 04:43
Jesse - having to pass a lambda for this case seems a little redundant to me; isn't that what operator== should be doing as it is, since they are fundamental types? — KBentley57, Sep 01 '16 at 04:49
libstdc++'s `equal` [dispatches to `memcmp` when it's safe](https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/bits/stl_algobase.h#L807). That's intended as an optimization, but perhaps that's not the right thing to do in this instance. — T.C., Sep 01 '16 at 04:58
@KBentley57: Yes, you are correct. This is an implementation issue in libstdc++'s library as pointed out by T.C. — Jesse Good, Sep 01 '16 at 05:04
I'm surprised `memcmp` is a function call rather than an intrinsic... — ildjarn, Sep 01 '16 at 21:55

score 1 · Answer 1 · answered Sep 01 '16 at 04:42

Here's the generated assembly of the for loop in main() when the are_equal_manual(left,right) function is used:

.L21:
        xor     esi, esi
        test    eax, eax
        jne     .L20
        cmp     edx, 2
        sete    sil
.L20:
        mov     rax, rcx
        movzx   esi, sil
        mul     r8
        shr     rdx, 3
        lea     rax, [rdx+rdx*4]
        mov     edx, ecx
        add     rax, rax
        sub     edx, eax
        mov     eax, edx
        mov     edx, ecx
        add     rcx, 1
        and     edx, 7
        cmp     rcx, rdi

And here's what's generated when the are_equal_alg(left,right) function is used:

.L20:
        lea     rsi, [rsp+16]
        mov     edx, 16
        mov     rdi, rsp
        call    memcmp
        mov     ecx, eax
        mov     rax, rbx
        mov     rdi, rbx
        mul     r12
        shr     rdx, 3
        lea     rax, [rdx+rdx*4]
        add     rax, rax
        sub     rdi, rax
        mov     eax, ebx
        add     rbx, 1
        and     eax, 7
        cmp     rbx, rbp
        mov     DWORD PTR [rsp], edi
        mov     DWORD PTR [rsp+24], eax
        jne     .L20

I'm not exactly sure what's happening in the generated code for first case, but it's clearly not calling memcmp(). It doesn't appear to be comparing the contents of the arrays at all. While the loop is still being iterated 5000000000 times, it's optimized to doing nothing much. However, the loop that uses are_equal_alg(left,right) is still performing the comparison. Basically, the compiler is still able to optimize the manual comparison much better than the std::equal template.

Michael - I used the modulus operator to emulate a call to what is essentially rand() on one of the elements for each loop iteration. If having the modulus isn't enough to keep the compiler from optimizing something out, I can try to edit and post a more realistic version. — KBentley57, Sep 01 '16 at 04:52
@KBentley57: the loop is still iterating - I think the problem is what Jesse Good and you noticed: that `std::equal<>()` doesn't appear to be directly comparing fundamental types when it probably should. I haven't done the work to track down the implementation of gcc's `std::equal`, but it seems to be relying on `memcmp()` in this case when there's probably a way it could avoid that for arrays (or vectors/other containers?) of fundamental types as you noted in your comment above. — Michael Burr, Sep 01 '16 at 04:58
I believe T.C. posted the implementation, and it does appear to be default to memcmp. Do you think this is worthy off bug report, or is it just the way it is for fundamental types? — KBentley57, Sep 01 '16 at 05:09

why is std::equal much slower than a hand rolled loop for two small std::array?

1 Answers1