Here is an article that describes how to calculate CRC32 of maximum 1024 bytes using the built in CRC32 instruction found in modern x86-64 processors. However, I need to calculate CRC32 of more than 1024 bytes. Would it be a correct approach to calculate CRC32 of each block of 1024 bytes and in the end sum them, or is it incorrect? If so, what is the correct way to do it?
2 Answers
Quoting from the intel white paper that your article mentions,
Instead of computing CRC of the entire message with a traditional linear method, we use a faster method to split an arbitrary length buffer to a number of smaller fixed size segments, compute the CRC on these segments in parallel followed by a recombination step of computing the effective CRC using the partial CRCs of the segments.
Also,
The final recombination of CRCs adds an overhead and can be implemented with lookup tables on the Nehalem microarchitecture – we show how to do this with as few tables as possible while giving excellent overall performance on the range of sizes. The PCLMULQDQ instruction in the Westmere microarchitecture allows efficient recombination of CRCs without lookup tables. The various methods are thoroughly explained in this paper with real code examples.
So you need to study this paper in detail: Fast CRC Computation for iSCSI Polynomial Using CRC32 Instruction

- 8,173
- 3
- 26
- 46

- 27,404
- 12
- 99
- 125
No, just adding won't do the job.
The article you linked tells us how to do it:
The CRC output of one calculation is used as the initial CRC for the next calculation [...]
To cover the case of the final result being larger then 0xffffffff
just do crc32 = ~crc32 & 0xffffffff
after the final calculation.

- 69,737
- 10
- 105
- 255
-
OK, so its a matter of passing the previous CRC to the next call. No problem with that! – pythonic Apr 26 '12 at 13:10
-
1This is simpler than the technique Pavan describes, but of course if you do it this way then you can't parallelize the different chunks, they have to be processed sequentially. That said, I personally haven't ever felt a need to parallelize a checksum calculation, one core should be enough for anyone ;-) – Steve Jessop Apr 26 '12 at 13:12