CRC32 of repeated sequence of bytes

Question

I read Mark Adler's explanation (here and here) of how crc32_combine uses a math trick to compute the effect of feeding one zero bit into the CRC32 state machine in O(log(n)) time, with a 32x32 matrix and exponentiation-by-squaring.

Is there a trick for efficiently calculating the effect of feeding in a sequence of n bytes repeated m times?

Let x be the CRC32 bit vector. We can compute a matrix M for the effects of applying 8n zero bits (O(log(n)) time), and a bit vector B = CRC32(the n bytes) (O(n) time). We can then calculate the new x as x' = Mx + B, and then repeat this step m times (O(m) time). This results in a total of O(n+m) time, but is there a better algorithm? The O(n) term seems unavoidable, since we'd have to read the n bytes as input, but the O(m) part seems like it could be optimized.

May be homogeneous coordinates do the trick. Build 33x33 matrix: ((M, byte),(0000001)) and try to exponentiate it. — Vovanium, Dec 08 '19 at 17:57
I updated my answer after verifying it would work using test code. — rcgldr, Dec 12 '19 at 01:39

rcgldr · Answer 1 · 2020-08-13T16:00:21.020

There is an alternative method using carryless mutliply, such as pclmulqdq (using xmm registers) on X86 processors (since 2010), which should be quicker than working with a m x m or larger binary matrix, and instead working with a 2 x 2 integer matrix using carryless multiplies modulo the 33 bit polynomial. Even if the method described below uses a software based carryless multiply, it still may be faster than working with a m x m or larger binary matrix.

Define + and - to mean XOR, · to mean carryless multiply, / to mean borrowless divide, % to mean borrowless modulo. Define p = 33 bit polynomial. Then

(a · b) % p

can be implemented using 3 carryless multiplies and one xor. Define the "inverse" of the polynomial as ip = (2^64)/p. Then (a · b) % p is implemented as

(a · b) % p = (a · b) - (p · ((a · b · ip)>>64))

For pclmulqdq, the >>64 is done by specifying the upper 64 bits of a 128 bit xmm register. The rest of the operations specify the lower 64 bits and produce a 128 bit product, of which only the lower 64 bits are used except for that one case of >>64.

Let B = CRC(n byte sequence). Let C1 = 2^{n8}%p. This can be done by exponentiation by squaring of (2^8)%p to get 2^{n8}%p in O(log(n)) time. For m = 2:

CRC = (B · (C1 + 1)) % p

For m = 3:

CRC = (B · (C1^2 + C1 + 1)) % p

For m = 4:

CRC = (B · (C1^3 + C1^2 + C1 + 1)) % p

The time complexity of the second step can be reduced to O(log(m)) using exponentiation by squaring of this 2x2 matrix (using carryless multiplies % p):

M = | C1 1 |
    |  0 1 |

CRC = (M^m)[0][1] · B) % p

With a X86 with pclmulqdq (xmm registers), a fast assembly program can calculate CRC at over 15 giga-bytes per second (on most recent processors), so trying to optimize repeating strings may not be that beneficial.

Vovanium · Answer 2 · 2019-12-15T00:22:04.953

If i didn't miss something, case with non-zero bytes is solved through exponentiation-by-squaring as well.

Say, you have a formula:

x' = Mx + B,

where x and B are m-dimensional vector, and M is a m×m matrix. If we define matrix N of size (m+1)×(m+1) and vectors y, y' of size (m+1), such that:

     ||   M   B ||       || x ||
 N = ||         ||   y = ||   ||
     || 0...0 1 ||       || 1 ||

then we can represent the formula above with single matrix multiplication:

           ||   M   B || || x ||   || M×x + B×1 ||   || M×x + B ||   || x' ||
y' = Ny =  || 0...0 1 || || 1 || = || 0×x + 1×1 || = ||     1   || = || 1  ||

Matrix N can be exponentiated as usual.

This is known as homogeneous coordinates is projective geometry. I do not know well projective geometry but they're used widely in 2d and 3d graphics.

I finally implemented this and it works (https://github.com/theonlypwner/crc32). I will probably try implementing the other answer in the future. — Victor, Aug 29 '23 at 00:05

CRC32 of repeated sequence of bytes

2 Answers2