There is an alternative method using carryless mutliply, such as pclmulqdq (using xmm registers) on X86 processors (since 2010), which should be quicker than working with a m x m or larger binary matrix, and instead working with a 2 x 2 integer matrix using carryless multiplies modulo the 33 bit polynomial. Even if the method described below uses a software based carryless multiply, it still may be faster than working with a m x m or larger binary matrix.
Define + and - to mean XOR, · to mean carryless multiply, / to mean borrowless divide, % to mean borrowless modulo. Define p = 33 bit polynomial. Then
(a · b) % p
can be implemented using 3 carryless multiplies and one xor. Define the "inverse" of the polynomial as ip = (2^64)/p. Then (a · b) % p is implemented as
(a · b) % p = (a · b) - (p · ((a · b · ip)>>64))
For pclmulqdq, the >>64 is done by specifying the upper 64 bits of a 128 bit xmm register. The rest of the operations specify the lower 64 bits and produce a 128 bit product, of which only the lower 64 bits are used except for that one case of >>64.
Let B = CRC(n byte sequence). Let C1 = 2^{n8}%p. This can be done by exponentiation by squaring of (2^8)%p to get 2^{n8}%p in O(log(n)) time. For m = 2:
CRC = (B · (C1 + 1)) % p
For m = 3:
CRC = (B · (C1^2 + C1 + 1)) % p
For m = 4:
CRC = (B · (C1^3 + C1^2 + C1 + 1)) % p
The time complexity of the second step can be reduced to O(log(m)) using exponentiation by squaring of this 2x2 matrix (using carryless multiplies % p):
M = | C1 1 |
| 0 1 |
CRC = (M^m)[0][1] · B) % p
With a X86 with pclmulqdq (xmm registers), a fast assembly program can calculate CRC at over 15 giga-bytes per second (on most recent processors), so trying to optimize repeating strings may not be that beneficial.