Fast CRC using PCLMULQDQ - final reduction of 128 bits

Question

I've been trying to implement the algorithm for CRC32 calculation as described here: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf; and I'm confused about Step 3, the reduction from 128 bits to 64 bits. Hopefully someone can clarify the steps for me:

Multiply the upper 64 bits of the remaining 128 bits with the constant K5, result is 96 bits
Multiply the upper 64 bits of the 96 bits with the constant K6, result is 64 bits

Do these results need to be XORed with the lower 64 bits of the starting 128 bits, following the pattern of the previous folds? Figure 8 in the paper doesn't specify, and I am confused by the alignment of the data in the figure.

Intel has already published a tuned implementation based on that whitepaper, so you might just want to use that if the licensing is compatible with your needs. [It's on github](https://github.com/01org/isa-l/blob/master/crc/crc16_t10dif_01.asm). See also this question about its [mixing of XORPS and PXOR](http://stackoverflow.com/questions/39811577/does-using-mix-of-pxor-and-xorps-affect-performance) — Peter Cordes, Oct 04 '16 at 19:32
@PeterCordes - I converted the code from github to work with Visual Studio 2015. The comments for rk3 and rk4 are wrong, probably because the code was enhanced to fold 128 bytes at a time instead of 64 bytes at a time. rk3 is 2^(32*31) mod Q << 32 and rk4 is 2^(32*33) mod Q << 32. rk9 through rk20 are not commented: using x in 2^(32*x) mod Q << 32, then x for rk9 through rk20 in order is {27, 29, 23, 25, 19, 21, 15, 17, 11, 13, 7, 9} (each pair decreases by 4 starting from the values for rk3, rk4). — rcgldr, Oct 14 '16 at 00:43
@PeterCordes - I couldn't find a tuned implementation for a bit reflected CRC, so I modified the tuned implementation, but it isn't as straight forward as the Intel white paper implies. All the pshufb's used to convert from little to big endian are deleted, since the entire xmm registers need to be bit reversed. The left / right justified constants affected where product results end up and the association shifting (or not shifting) needed. I have it working, except I need to handle the initial crc value passed as a parameter. — rcgldr, Jan 23 '17 at 08:51
@rcgldr: I haven't ever really looked much at CRC implementations, tuned or otherwise. I know enough to mostly understand your comment, but I don't have anything to add :/. If you get something working well, you should probably edit your answer to include a link. (e.g. fork it on github). — Peter Cordes, Jan 23 '17 at 10:51
@PeterCordes - It seems to be working. I'll clean it up and do more testing. I may also have a minor clean up for the less than 16 byte handling for the original github version, using variable shifts with just one check for < 4 handling (one fixed shift, one variable shift), and both the versions I have are for Visual Studio. — rcgldr, Jan 23 '17 at 12:21
@rcgldr: Efficiently handling inputs of less than a full vector width is often annoying :/ That seems to be a nearly universal problem when vectorizing. — Peter Cordes, Jan 23 '17 at 15:01
@PeterCordes - The boundaries are 128 bytes, 32 bytes, 16 bytes, and 4 bytes. The code from Intel has a series of conditionals for <16, <8, <4, <2, <1, ... , with hard coded data stores into a stack buffer, a load and shift as needed. I combined the 4 to 15 remaining to just copy data into stack buffer, load and do a variable shift, then jump to the 16 byte fold code. For 1 to 3 bytes, the variable shift puts the data in the middle of a register and jumps to 8 byte (barrett) code. The 1 to 3 byte case is different since it has to deal with the 4 byte input crc. — rcgldr, Jan 25 '17 at 05:00
@PeterCordes and others that may read this - late follow up, there's a difference between the github code and the Intel doc. The github code shifts constants like rk3 left 32 bits, folding into the upper 96 of 128 bits. while the Intel doc doesn't shift constants, folding into the lower 96 of 128 bits, (eg Intel rk3 = 2^(32*32) mod Q, github rk3 = (2^(32*31) mod Q) << 32. — rcgldr, Dec 19 '18 at 16:14

rcgldr · Accepted Answer · 2018-12-19T16:24:41.627

It appears that figure 8 shows the final 128 bits (working remainder xor last 128 bits of buffer data) followed by 32 bits of appended zeros, since crc32 = (msg(x) • x^32) % p(x). So you see a total of 160 bits as 64|32|32|32.

My assumption is that the upper 64 bits are multiplied by K5 producing a 96 bit product. That product is then xor'ed to the lower 96 bits of the 160 bit entity (remember the lower 32 bits start off as 32 bits of appended zeros).

Then the upper 32 bits (not 64) of the lower 96 bits are multiplied by K6 producing a 64 bit product which is xor'ed to the lower 64 bits of the 160 bit entity.

Then the Barrett algorithm is used to produce a 32 bit CRC from the lower 64 bits of the 160 bit entity (where the lower 32 bits were originally appended zeros).

To explain the Barrett algorithm, consider the 64 bits as a dividend, and the CRC polynomial as a divisor. Then remainder = dividend - (⌊ dividend / divisor ⌋ · divisor). Rather than actually divide, pclmulqdq is used, and ⌊ dividend / divisor ⌋ = (dividend · ⌊ 2^64 / divisor ⌋) >> 64.

Thank you! I did get it to work correctly by appending the 32 bits of zeros and xor-ing the multiplication results with the lower bits. — M. Anne, Oct 03 '16 at 21:50

Fast CRC using PCLMULQDQ - final reduction of 128 bits

1 Answers1