How do I get clang/gcc to vectorize looped array comparisons?

Question

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    for(int ii = 0; ii < 64; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,8);
    b2=(uint8_t*)__builtin_assume_aligned(b2,8);
    for(int ii = 0; ii < 8; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?

Have you tried out the performance of [`std::equal`](https://en.cppreference.com/w/cpp/algorithm/equal)? — NathanOliver, Aug 17 '22 at 21:09
Near duplicate of [Auto vectorization with Rust](https://stackoverflow.com/a/73119184) - GCC/LLVM are incapable of auto-vectorizing loops that have an early-exit condition. — Peter Cordes, Aug 17 '22 at 21:37

score 3 · Accepted Answer · answered Aug 17 '22 at 21:14

3

"I assume" the following is most efficient

memcmp(b1, b2, any_size_you_need);

especially for huge arrays!

(For small arrays, there is not a lot to gain anyway!)

Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.

answered Aug 17 '22 at 21:14

sidcoder

460
2
6

1

Yes, this is definitely a pragmatic solution (assuming OP always needs bitwise equality). Note that most compilers are able to inline/unroll `memcmp` for small sizes known at compile time, e.g., clang gives almost the same in these three cases: https://godbolt.org/z/YPGG7Edx7 (`assume_aligned` is not considered -- and gcc is still sub-optimal, but better than the manual loop). – chtz Aug 17 '22 at 21:27
For manual vectorization, I propose to look at "Intel Intrinsics Guide". Basic idea is to "load" the 8-bit values into a special larger register (128, 256, 512 - whatever is available) for parts of b1 and b2. And then compare the two special registers in one step. – sidcoder Aug 17 '22 at 21:52
Based on my previous comment, it might be a similar idea to cast b1 and b2 to e.g. 64-bit integer and compare those then in bigger steps (anyway care to be taken at the end of the array). – sidcoder Aug 17 '22 at 21:57

score 0 · Answer 2 · answered Aug 17 '22 at 21:07

0

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory. You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    bool ret = true;
    for(int ii = 0; ii < 64; ++ii){
        ret &= (b1[ii]==b2[ii]);
    }
    return ret;
}

Godbolt demo: https://godbolt.org/z/3ePh7q5rM

gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.

answered Aug 17 '22 at 21:07

chtz

17,329
4
26
56

1

If it aligned the pointer, it would be safe, just like how `memcmp` or `strcmp` are hand-vectorized with inline asm. You might be right that GCC avoids reading memory the C abstract machine wouldn't have, even if it's safe on the target (because memory protection has page granularity: [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739)) Perhaps because of tools like valgrind? Anyway, the showstopper for GCC/clang is that the trip-count can't be known before the first iteration, so removing the early-out makes it *possible* – Peter Cordes Aug 17 '22 at 21:35
If I compare `struct Blob { uint8_t data[64]; bool operator==(const Blob&) const = default; }` it still does it byte by byte. I don't get why the default `==` doesn't generate the same code as `memcmp()` (which doesn't use SSE but compares 2x 8 byte at at time). – Goswin von Brederlow Aug 19 '22 at 06:54

How do I get clang/gcc to vectorize looped array comparisons?

2 Answers2