bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
for(int ii = 0; ii < 64; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,8);
b2=(uint8_t*)__builtin_assume_aligned(b2,8);
for(int ii = 0; ii < 8; ++ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?