Great question Mark!
No. It isn't faster.
At least not for the benchmarking I have done with clang on an Intel processor and gcc on an ARM64 processor.
Indeed the ?
results in either conditional branches or conditional instructions. Despite that, the compiled code for the ?
approach ends up a third faster than the code using -
and &
, which avoids conditional execution.
See the code below used for testing. Also included are the compiled instructions for that innermost statement. I also tested another alternative than the ones you listed Mark:
crc = (crc >> 1) ^ (crc & 1 ? 0xa001 : 0);
That one still uses a conditional, but isolates it to what is exclusive-or'ed to the shifted CRC. Interestingly, this compiles on Intel to the exact same code as the minus-and approach, with no conditional execution. On ARM it compiles to different, third set of instructions. For both Intel and ARM, it is still slower than the first approach.
Bottom line, the common conditional execution approach is 30% to 35% faster than attempts at avoiding the conditional execution.
Why, you may ask? As best as I can tell for ARM, it is that you execute fewer instructions on average using conditionals. The conditional approach can take advantage of the fact that one instruction will only be executed half the time, on average. That apparently outweighs the cost of unpredicted branches. Also modern processors seem to be doing a very good job handling unpredictable branches. For Intel, it is six instructions either way, and where it does not seem there is any benefit half the time.
This is only for two architectures, technologies, and associated compilers. Your mileage may vary.
// Test the speed of alternative bit-wise CRC implementations. To do this, we
// simply propagate a CRC assuming zeros for the data with an initial value of
// all ones. A down-shifting (reflected) 16-bit CRC is used. All times are for
// one billion iterations, with each iteration cycling eight zero bits.
#include <stddef.h>
// Implementation using a conditional expression. This compiles to use a
// conditional move instruction on Intel or a conditional branch on ARM, and an
// unrolled inner loop on both. There are six instructions per iteration on
// Intel, 2.5 on average on ARM. This compiles to the fastest code on both
// architectures.
//
// Intel i7 (clang 11.0.3 -O3): 6.6 seconds
// movl %eax, %ecx
// shrl %ecx
// movl %ecx, %edx
// xorl $40961, %edx ## imm = 0xA001
// testb $1, %al
// cmovel %ecx, %edx
//
// ARM64 (gcc 6.3.0 -O3): 12.0 seconds
// tbnz x0, 0, .L4
// lsr w0, w0, 1
// .L5:
//---------
// .L4:
// eor w0, w2, w0, lsr 1 ## w2 = 0xa001
// b .L5
unsigned crc16_cond(unsigned crc, size_t len) {
while (len--)
for (unsigned k = 0; k < 8; k++)
crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;
return crc;
}
// Different implementation using a conditional expression as before, but here
// just to select zero or the polynomial. This compiles to use no conditionals
// and an unrolled inner loop. There are six instructions per iteration on
// Intel, three on ARM. This ends up compiling to the exact same instructions
// as the minus implementation on Intel.
//
// Intel i7 (clang 11.0.3 -O3): 8.9 seconds
// movl %eax, %ecx
// shrl %ecx
// andl $1, %eax
// negl %eax
// andl $40961, %eax ## imm = 0xA001
// xorl %ecx, %eax
//
// ARM64 (gcc 6.3.0 -O3): 15.5 seconds
// tst x0, 1
// csel w3, w1, wzr, ne ## w1 = 0xa001
// eor w0, w3, w0, lsr 1
unsigned crc16_cond2(unsigned crc, size_t len) {
while (len--)
for (unsigned k = 0; k < 8; k++)
crc = (crc >> 1) ^ (crc & 1 ? 0xa001 : 0);
return crc;
}
// Implementation using a minus and an and to select zero or the polynomial.
// This compiles as written in an unrolled inner loop. There are six
// instructions per iteration on Intel and three on ARM.
//
// Intel i7 (clang 11.0.3 -O3): 8.9 seconds
// movl %eax, %ecx
// shrl %ecx
// andl $1, %eax
// negl %eax
// andl $40961, %eax ## imm = 0xA001
// xorl %ecx, %eax
//
// ARM64 (gcc 6.3.0 -O3): 16.0 seconds
// sbfx x2, x0, 0, 1
// and w2, w2, w1 ## w1 = 0xa001
// eor w0, w2, w0, lsr 1
unsigned crc16_minus(unsigned crc, size_t len) {
while (len--)
for (unsigned k = 0; k < 8; k++)
crc = (crc >> 1) ^ ((0 - (crc & 1)) & 0xa001);
return crc;
}
#include <stdio.h>
int main(void) {
unsigned crc = crc16_cond2(0xffff, 1000000000);
printf("%04x\n", crc);
return 0;
}