Does the CRC calculation trick using subtraction to avoid conditional execution make it faster?

Question

A typical innermost statement in a C or C++ implementation of a bit-wise CRC calculation looks like this:

crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;

The ? suggests that the implementation will execute instructions conditionally. The condition will be true about half the time and false about half the time, so it cannot be predicted. The prevailing wisdom is that unpredictable conditional branches should be avoided in order to optimize code.

The same calculation can be done with no conditionals thusly:

crc = (crc >> 1) ^ ((0 - (crc & 1)) & 0xa001);

In this case, the intermediate calculations can all be done predictably in the arithmetic-logic unit of the CPU. So in theory this should be faster.

Is it?

At least in my case, the subtraction "trick" is a legacy thing (older compiler, older processors, 486, 68000, ... ). The is another variation used for left shifting CRC, using 32 bit CRC as an example, instead of (0 - (CRC>>31)), (((int)CRC)>>31) (signed right shift to generate all 0 or all 1 bits. Modern compilers will do "tricks" like this to avoid conditionals. — rcgldr, Aug 01 '20 at 22:35
Just a side question, are you extremely short on ROM or something? Why not use the LUT approach, which is faster? — Boris Lipschitz, Aug 01 '20 at 23:36
@BorisLipschitz - Although the question is an example of CRC, I think the more general case would be something like `condition ? value : 0;` . If speed is the goal, and if the buffer is large enough (256 bytes or so), for X86, using pclmulqdq and xmm registers is 7+ times as fast as shown in [crc32fa.asm](https://github.com/jeffareid/crc/blob/master/crc32f/crc32fa.asm) . (My system doesn't have AVX512, zmm registers should be faster). — rcgldr, Aug 02 '20 at 00:14
@BorisLipschitz Indeed there are cases where memory for code is very limited, e.g. in embedded processors. The point here however is not the application, but the interesting result in optimization. I am well aware of byte-wise and word-wise table implementations, as generated by my [crcany](https://github.com/madler/crcany) code. — Mark Adler, Aug 02 '20 at 01:45

Mark Adler · Answer 1 · 2020-08-02T14:45:27.777

Great question Mark!

No. It isn't faster.

At least not for the benchmarking I have done with clang on an Intel processor and gcc on an ARM64 processor.

Indeed the ? results in either conditional branches or conditional instructions. Despite that, the compiled code for the ? approach ends up a third faster than the code using - and &, which avoids conditional execution.

See the code below used for testing. Also included are the compiled instructions for that innermost statement. I also tested another alternative than the ones you listed Mark:

crc = (crc >> 1) ^ (crc & 1 ? 0xa001 : 0);

That one still uses a conditional, but isolates it to what is exclusive-or'ed to the shifted CRC. Interestingly, this compiles on Intel to the exact same code as the minus-and approach, with no conditional execution. On ARM it compiles to different, third set of instructions. For both Intel and ARM, it is still slower than the first approach.

Bottom line, the common conditional execution approach is 30% to 35% faster than attempts at avoiding the conditional execution.

Why, you may ask? As best as I can tell for ARM, it is that you execute fewer instructions on average using conditionals. The conditional approach can take advantage of the fact that one instruction will only be executed half the time, on average. That apparently outweighs the cost of unpredicted branches. Also modern processors seem to be doing a very good job handling unpredictable branches. For Intel, it is six instructions either way, and where it does not seem there is any benefit half the time.

This is only for two architectures, technologies, and associated compilers. Your mileage may vary.

// Test the speed of alternative bit-wise CRC implementations. To do this, we
// simply propagate a CRC assuming zeros for the data with an initial value of
// all ones. A down-shifting (reflected) 16-bit CRC is used. All times are for
// one billion iterations, with each iteration cycling eight zero bits.

#include <stddef.h>

// Implementation using a conditional expression. This compiles to use a
// conditional move instruction on Intel or a conditional branch on ARM, and an
// unrolled inner loop on both. There are six instructions per iteration on
// Intel, 2.5 on average on ARM. This compiles to the fastest code on both
// architectures.
//
// Intel i7 (clang 11.0.3 -O3): 6.6 seconds
//      movl    %eax, %ecx
//      shrl    %ecx
//      movl    %ecx, %edx
//      xorl    $40961, %edx            ## imm = 0xA001
//      testb   $1, %al
//      cmovel  %ecx, %edx
//
// ARM64 (gcc 6.3.0 -O3): 12.0 seconds
//      tbnz    x0, 0, .L4
//      lsr     w0, w0, 1
//  .L5:
//---------
//  .L4:
//      eor    w0, w2, w0, lsr 1        ## w2 = 0xa001
//      b     .L5
unsigned crc16_cond(unsigned crc, size_t len) {
    while (len--)
        for (unsigned k = 0; k < 8; k++)
            crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;
    return crc;
}

// Different implementation using a conditional expression as before, but here
// just to select zero or the polynomial. This compiles to use no conditionals
// and an unrolled inner loop. There are six instructions per iteration on
// Intel, three on ARM. This ends up compiling to the exact same instructions
// as the minus implementation on Intel.
//
// Intel i7 (clang 11.0.3 -O3): 8.9 seconds
//      movl    %eax, %ecx
//      shrl    %ecx
//      andl    $1, %eax
//      negl    %eax
//      andl    $40961, %eax            ## imm = 0xA001
//      xorl    %ecx, %eax
//
// ARM64 (gcc 6.3.0 -O3): 15.5 seconds
//      tst     x0, 1
//      csel    w3, w1, wzr, ne         ## w1 = 0xa001
//      eor     w0, w3, w0, lsr 1
unsigned crc16_cond2(unsigned crc, size_t len) {
    while (len--)
        for (unsigned k = 0; k < 8; k++)
            crc = (crc >> 1) ^ (crc & 1 ? 0xa001 : 0);
    return crc;
}

// Implementation using a minus and an and to select zero or the polynomial.
// This compiles as written in an unrolled inner loop. There are six
// instructions per iteration on Intel and three on ARM.
//
// Intel i7 (clang 11.0.3 -O3): 8.9 seconds
//      movl    %eax, %ecx
//      shrl    %ecx
//      andl    $1, %eax
//      negl    %eax
//      andl    $40961, %eax            ## imm = 0xA001
//      xorl    %ecx, %eax
//
// ARM64 (gcc 6.3.0 -O3): 16.0 seconds
//      sbfx    x2, x0, 0, 1
//      and     w2, w2, w1              ## w1 = 0xa001
//      eor     w0, w2, w0, lsr 1
unsigned crc16_minus(unsigned crc, size_t len) {
    while (len--)
        for (unsigned k = 0; k < 8; k++)
            crc = (crc >> 1) ^ ((0 - (crc & 1)) & 0xa001);
    return crc;
}

#include <stdio.h>

int main(void) {
    unsigned crc = crc16_cond2(0xffff, 1000000000);
    printf("%04x\n", crc);
    return 0;
}

`Also modern processors seem to be doing a very good job handling unpredictable branches.` No, compilers do great job not using branch instructions. — 0___________, Aug 01 '20 at 23:18
@P__J__ I guess you're missing the point that the fastest one is where the compiler _did_ use branch instructions (for the ARM anyway -- it used a conditional move for Intel, for which the processor still needs to track two paths). — Mark Adler, Aug 02 '20 at 01:41
If you have the most efficient asm, why don't you use it? You can't guarantee that the compiler will continue to generate said instructions, can you? — xaxxon, Aug 02 '20 at 04:11
> *for which the processor still needs to track two paths* No there are no two paths. See the answer by Peter Cordes: [Is CMOVcc considered a branching instruction?](https://stackoverflow.com/a/57524462/2279422) — Waqar, Aug 02 '20 at 06:34
Slower on my system (Visual Studio / Intel 3770K (3rd gen i7)). I'll switch to ? : the next time I copy / paste old code (some of it going back to 68000 / Atari ST - 1980's). — rcgldr, Aug 05 '20 at 14:12

Waqar · Answer 2 · 2020-08-02T06:41:20.843

For me on my personal machine (Intel Core i5 4300U), I got the following results with Google benchmark:

-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
NoBranch   11333534841 ns   11320263595 ns            1
Branch     11195210177 ns   11182300909 ns            1

Furthermore, this can be verified on Quickbench with Clang (10.0).

For GCC (10.1), the performance is same. Quickbench

The fastest case(with Clang) seems to be:

    crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;

Clang 10.0 Assembly:

        mov     ecx, edi
        shr     ecx
        mov     eax, ecx
        xor     eax, 40961
        test    dil, 1
        cmove   eax, ecx

Assembly with GCC 10.1 x86-64:

        mov     eax, edi
        shr     eax
        mov     edx, eax
        xor     edx, 40961
        and     edi, 1
        cmovne  eax, edx

# Total uops = 7

The following reasoning in your answer is incorrect:

The conditional approach can take advantage of the fact that one or two instructions will only be executed half the time, on average. That apparently outweighs the cost of unpredicted branches. Also modern processors seem to be doing a very good job handling unpredictable branches.

There's NO branch prediction going on here. There's however 'predication' or conditional register to register mov's which may very well be the reason for this speed up.

Update

I removed the inner while loop for the benchmark so that we only measure what we want and leave out the looping to google benchmark. The functions now look like:

unsigned crc16_br(unsigned crc) {
        for (unsigned k = 0; k < 8; k++)
            crc = crc & 1 ? (crc >> 1) ^ 0xa001 : crc >> 1;
    return crc;
}

The performance (with GCC and Clang (x86-64)) is same for all three variations with a very minor difference. Link to benchmark. Godbolt asm output for both GCC and Clang.

Unpredicted branches and no branch prediction sounds like the same thing to me. — Mark Adler, Aug 02 '20 at 01:42
I guess I may have misunderstood you in that case. However, there are *no* branches here. That is the whole point of using conditional instructions. And `cmov` is never free. — Waqar, Aug 02 '20 at 06:27
Also, ARM processors have short pipelines, which lead to a relatively smaller mispredict penalty. — Waqar, Aug 02 '20 at 15:10

Does the CRC calculation trick using subtraction to avoid conditional execution make it faster?

2 Answers2

Update