2

I have the following code(manual version is from Adler's answer)

 #include <iostream>
 #include <nmmintrin.h>

     #define POLY2 0x82f63b78  
    uint32_t crc32c2(uint32_t crc, const unsigned char *buf, size_t len)
    {
        int k;

        crc = ~crc;
        while (len--) {
            crc ^= *buf++;
            for (k = 0; k < 8; k++)
                crc = crc & 1 ? (crc >> 1) ^ POLY2 : crc >> 1;
        }
        return ~crc;
    }

    int main(int argc, char **argv)
    {
        const unsigned int val = 5;
        std::cout << std::hex << crc32c2(0,(const unsigned char*)&val,4) << std::endl;   
        std::cout << _mm_crc32_u32(0, 5) << std::endl;
    }

Output is:

ee00d08c

a6679b4b

My question is why the manual version does not give the same answer as the intrisic.

NoSenseEtAl
  • 28,205
  • 28
  • 128
  • 277

1 Answers1

4

Mark Adler's answer on Implementing SSE 4.2's CRC32C in software shows that you need to start with 0 ^ 0xffffffff, and end with crc0 ^ 0xffffffff; to pre and post process. (Or use the ~ operator like you're doing in the SW version).

Mark's answer uses GNU C inline asm, but an intrinsics port it would be simple. (It unrolls with multiple accumulators to hide the latency of crc32_u64 over a big buffer.)

This version works on my system.

int main(int argc, char **argv)
{
    const unsigned int val = 5;
    std::cout << std::hex << crc32c2(0,(const unsigned char*)&val,4) << '\n';   
    std::cout << (_mm_crc32_u32(0^0xffffffff, 5) ^ 0xffffffffU) << '\n';
}

(Note that std::endl is pointlessly slower than a newline, unless you actually need to force a flush in case the stream was full-buffered instead of line buffered.)

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • ah so intrisic is meant to be used for example in inner loop of a function so it can not do pre and post process... makes sense – NoSenseEtAl Jun 18 '18 at 00:14
  • @NoSenseEtAl: I'm not sure why the asm instruction doesn't invert, actually. Inverting the input/output values would cancel out when used in a loop, unless I'm missing something. Maybe there's a use-case for the non-inverting version, otherwise IDK why Intel would leave more work for software to do. (Inverting is basically free in hardware: I think a logic gate that logically inverts one of its inputs, or its output, doesn't take extra transistors vs. a normal gate.) The intrinsic is of course just a wrapper for the asm instruction. – Peter Cordes Jun 18 '18 at 00:45
  • I am not sure it would cancel out since in the loop CRC operates on bits and I am not sure it is symetric in a sense that ~ input would just give ~result. But I rarely use bitwise operations, so you are probably right :) – NoSenseEtAl Jun 18 '18 at 00:51
  • 1
    @NoSenseEtAl: The output of one `_mm_crc32_u64` / u32 / u16 / u8 is the `crc`) input to the next one (see Mark Adler's loop). Two bitwise inversions are the same as zero. If you're using multiple accumulators, you need to combine them, and that might use a byte LUT indexed by bytes of the CRC accumulator. (See `crc32c_shift` and the call-site). – Peter Cordes Jun 18 '18 at 01:02
  • 1
    A table with bit-inverted indexing wouldn't cost any extra, but wouldn't be a drop-in replacement for software CRC routines if they did that, too. So maybe Intel left out the pre/post merely to simplify software development? That isn't normally worth it, but it's only a trivial amount of overhead for medium to large buffers. (2 uops, 2c latency, probably 4 bytes of code.) – Peter Cordes Jun 18 '18 at 01:03