1
bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    for(int ii = 0; ii < 64; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,8);
    b2=(uint8_t*)__builtin_assume_aligned(b2,8);
    for(int ii = 0; ii < 8; ++ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?

arc31
  • 127
  • 1
  • 5
  • 1
    Have you tried out the performance of [`std::equal`](https://en.cppreference.com/w/cpp/algorithm/equal)? – NathanOliver Aug 17 '22 at 21:09
  • 1
    Near duplicate of [Auto vectorization with Rust](https://stackoverflow.com/a/73119184) - GCC/LLVM are incapable of auto-vectorizing loops that have an early-exit condition. – Peter Cordes Aug 17 '22 at 21:37

2 Answers2

3

"I assume" the following is most efficient

memcmp(b1, b2, any_size_you_need);

especially for huge arrays!

(For small arrays, there is not a lot to gain anyway!)

Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.

sidcoder
  • 460
  • 2
  • 6
  • 1
    Yes, this is definitely a pragmatic solution (assuming OP always needs bitwise equality). Note that most compilers are able to inline/unroll `memcmp` for small sizes known at compile time, e.g., clang gives almost the same in these three cases: https://godbolt.org/z/YPGG7Edx7 (`assume_aligned` is not considered -- and gcc is still sub-optimal, but better than the manual loop). – chtz Aug 17 '22 at 21:27
  • For manual vectorization, I propose to look at "Intel Intrinsics Guide". Basic idea is to "load" the 8-bit values into a special larger register (128, 256, 512 - whatever is available) for parts of b1 and b2. And then compare the two special registers in one step. – sidcoder Aug 17 '22 at 21:52
  • Based on my previous comment, it might be a similar idea to cast b1 and b2 to e.g. 64-bit integer and compare those then in bigger steps (anyway care to be taken at the end of the array). – sidcoder Aug 17 '22 at 21:57
0

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory. You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    bool ret = true;
    for(int ii = 0; ii < 64; ++ii){
        ret &= (b1[ii]==b2[ii]);
    }
    return ret;
}

Godbolt demo: https://godbolt.org/z/3ePh7q5rM

gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • 1
    If it aligned the pointer, it would be safe, just like how `memcmp` or `strcmp` are hand-vectorized with inline asm. You might be right that GCC avoids reading memory the C abstract machine wouldn't have, even if it's safe on the target (because memory protection has page granularity: [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739)) Perhaps because of tools like valgrind? Anyway, the showstopper for GCC/clang is that the trip-count can't be known before the first iteration, so removing the early-out makes it *possible* – Peter Cordes Aug 17 '22 at 21:35
  • If I compare `struct Blob { uint8_t data[64]; bool operator==(const Blob&) const = default; }` it still does it byte by byte. I don't get why the default `==` doesn't generate the same code as `memcmp()` (which doesn't use SSE but compares 2x 8 byte at at time). – Goswin von Brederlow Aug 19 '22 at 06:54