I've been trying to implement the algorithm for CRC32 calculation as described here: http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf; and I'm confused about Step 3, the reduction from 128 bits to 64 bits. Hopefully someone can clarify the steps for me:
- Multiply the upper 64 bits of the remaining 128 bits with the constant K5, result is 96 bits
- Multiply the upper 64 bits of the 96 bits with the constant K6, result is 64 bits
Do these results need to be XORed with the lower 64 bits of the starting 128 bits, following the pattern of the previous folds? Figure 8 in the paper doesn't specify, and I am confused by the alignment of the data in the figure.