3

I am calculating CRC on a large chunk of data every cycle in hardware (64B per cycle). In order to parallelize the CRC calculation, I want to calculate the CRC for small data chunks and then XOR them in parallel.

Approach:

  1. We divide the data into small chunks (64B data divided into 8 chunks of 8B each).
  2. Then we calculate CRC's for all the chunks individually (8 CRC's in parallel for 8B chunks).
  3. Finally calculate the CRC for padded data. This answer points out that the CRC for padded data is obtained by multiplying the old CRC with x^n.

Hence, I am calculating the CRC for a small chunk of data, then multiply it with CRC of 0x1 shifted by 'i' times as shown below.

In short, I am trying to accomplish below: CRC approach

For example: CRC-8 on this site:

Input Data=(0x05 0x07) CRC=0x54
Step-1: Data=0x5 CRC=0x1B
Step-2: Data=0x7 CRC=0x15
Step-3: Data=(0x1 0x0) CRC=0x15
Step-4: Multiply step-1 CRC and step-3 CRC with primitive polynomial 0x7. So, I calculate (0x1B).(0x15) = (0x1 0xC7) mod 0x7.
Step-5: Calculate CRC Data=(0x1 0xC7) CRC=0x4E (I assume this is same as (0x1 0xC7) mod 0x7)
Step-6: XOR the result to get the final CRC. 0x4E^0x15=0x5B

As we can see, the result in step-6 is not the correct result.

Can someone help me how to calculate the CRC for padded data? Or where am I going wrong in the above example?

sharvil111
  • 4,301
  • 1
  • 14
  • 29
  • What is the format of the messages? Are all messages 63 bytes of data and 1 byte of CRC? – rcgldr Jan 20 '22 at 19:31
  • No, it is variable length, anything from 1 to 64 byte. – sharvil111 Jan 22 '22 at 15:42
  • Does the 64 bytes include the CRC byte? If so, then the smallest message size would be 2 bytes, 1 data byte, 1 CRC byte. Does the hardware need to generate the CRC byte as well as check it? How is the data being transmitted / received? Usually hardware for serial transfer of data can calculate CRC on the fly 1 bit at a time using a [linear feedback shift register](https://en.wikipedia.org/wiki/Linear-feedback_shift_register). – rcgldr Jan 22 '22 at 18:36
  • Agree, the logic at its core is a kind of LFSR, processing 64B in a single cycle. The data may or may not contain CRC depending on the command configuration (check/replace/update/remove/add etc. commands). Hence, for the worst case, the data length can be 1-64B per cycle. – sharvil111 Jan 23 '22 at 04:21
  • Why is the hardware processing an entire message at one time? How is the data being read | received? If data length is 1 to 64 bytes, then for a 64 byte message where is the CRC byte stored? – rcgldr Jan 23 '22 at 09:39
  • The flit length is 1-64B data bytes that is processed by HW at a time. The entire msg can be of any indefinite length. HW processes 64B max in one cycle and stores residue for the next flit. If data length is 64B, and HW has to append the CRC, then it shifts the EOP (end of packet) to next cycle and adds CRC over there. – sharvil111 Jan 23 '22 at 10:18
  • 1
    In that case, the HW can process up to 64B at at time, just generating 16 bit folded values (1 multiply per byte) until the end of a message is received, and at that point convert the final 16 bit folded value into a CRC as shown in my answer in the CRC(V) part. A slight adjustment will be needed to handle a final flit with an odd number of bytes. Is the hardware also going to be used to generate the CRC to append to the data? – rcgldr Jan 23 '22 at 11:19
  • That is exactly I am doing. For the final flit, I replace the unused data bytes with zeros, which results in 0 in multiplication and doesn't affect the output. :-) – sharvil111 Jan 23 '22 at 11:34
  • Trailing zeroes end up "padding" the CRC, which you don't want. Say the final flit has just one byte, then the 16 bit folded value needs to be shifted: upper 8 bits folded 2 bytes right, lower 8 bits folded 1 byte right, then xor'ed with the one data byte (data byte xor'ed to lower 8 bits of folded value), to create the final 16 bit folded value V. Then calculate CRC(V) as shown in my answer. The point is to fold the prior 16 bit folded value so it "lines" up with the last data|crc byte in the flit buffer. – rcgldr Jan 23 '22 at 11:47
  • Sorry for the confusion, you are right. I mean I didn't consider the MSB of the data when length <64B. Somehow I got confused with some other thread. – sharvil111 Jan 23 '22 at 11:51
  • If you really intend on using the folding method, I can add example code to my answer in a day or two. Please let me know. – rcgldr Jan 23 '22 at 12:05
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/241329/discussion-between-sharvil111-and-rcgldr). – sharvil111 Jan 23 '22 at 12:51

2 Answers2

3

Rather than calculate and then adjust multiple CRC's, bytes of data can be carryless multiplied to form a set of 16 bit "folded" products, which are then xor'ed and a single modulo operation performed on the xor'ed "folded" products. An optimized modulo operation uses two carryless multiples, so it's avoided until all folded products have been generated and xor'ed together. A carryless multiply uses XOR instead of ADD and a borrowless divide uses XOR instead of SUB. Intel has a pdf file about this using the XMM instruction PCLMULQDQ (carryless multiply), where 16 bytes are read at a time, split into two 8 byte groups, with each group folded into a 16 byte product, and the two 16 byte products are xor'ed to form a single 16 byte product. Using 8 XMM registers to hold folding products, 128 bytes at time are processed. (256 bytes at at time in the case of AVX512 and ZMM registers).

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf

Assume your hardware can implement a carryless multiply that takes two 8 bit operands and produces a 16 bit (technically 15 bit) product.

Let message = M = 31 32 33 34 35 36 37 38. In this case CRC(M) = C7

pre-calculated constants (all values shown in hex):

2^38%107 = DF    cycles forwards 0x38 bits
2^30%107 = 29    cycles forwards 0x30 bits
2^28%107 = 62    cycles forwards 0x28 bits
2^20%107 = 16    cycles forwards 0x20 bits
2^18%107 = 6B    cycles forwards 0x18 bits
2^10%107 = 15    cycles forwards 0x10 bits
2^08%107 = 07    cycles forwards 0x08 bits
2^00%107 = 01    cycles forwards 0x00 bits

16 bit folded (cycled forward) products (can be calculated in parallel):    
31·DF = 16CF
32·29 = 07E2
33·62 = 0AC6
34·16 = 03F8
35·6B = 0A17
36·15 = 038E
37·07 = 0085
38·01 = 0038
        ----
V =     1137    the xor of the 8 folded products
CRC(V) = 113700 % 107 = C7

To avoid having to use borrowless divide for the modulo operation, CRC(V) can be computed using carryless multiply. For example

V = FFFE
CRC(V) = FFFE00 % 107 = 23.

Implementation, again all values in hex (hex 10 = decimal 16), ⊕ is XOR.

input:
V = FFFE
constants:
P = 107                  polynomial
I = 2^10 / 107 = 107     "inverse" of polynomial
                         by coincidence, it's the same value
2^10 % 107 = 15          for folding right 16 bits
fold the upper 8 bits of FFFE00 16 bits to the right:
U = FF·15 ⊕ FE00 = 0CF3 ⊕ FE00 = F2F3  (check: F2F3%107 = 23 = CRC)
Q = ((U>>8)·I)>>8 = (F2·107)>>8 = ...
to avoid a 9 bit operand, split up 107 = 100 ⊕ 7 
Q = ((F2·100) ⊕ (F2·07))>>8 = ((F2<<8) ⊕ (F2·07))>>8 = (F200 ⊕ 02DE)>>8 = F0DE>>8 = F0
X = Q·P = F0·107 = F0·100 ⊕ F0·07 = F0<<8 ⊕ F0·07 = F000 ⊕ 02D0 = F2D0
CRC = U ⊕ X = F2F3 ⊕ F2D0 = 23

Since the CRC is 8 bits, there's no need for the upper 8 bits in the last two steps, but it doesn't help that much for the overall calculation.

X = (Q·(P&FF))&FF = (F0·07)&FF = D0
CRC = (U&FF) ⊕ X = F3 ⊕ D0 = 23

Example program to generate 2^0x10 / 0x107 and powers of 2 % 0x107:

#include <stdio.h>

typedef unsigned char  uint8_t;
typedef unsigned short uint16_t;

#define poly 0x107

uint16_t geninv(void)           /* generate 2^16 / 9 bit poly */
{
uint16_t q = 0x0000u;           /* quotient */
uint16_t d = 0x0001u;           /* initial dividend = 2^0 */
    for(int i = 0; i < 16; i++){
        d <<= 1;
        q <<= 1;
        if(d&0x0100){           /* if bit 8 set */
            q |= 1;             /*   q |= 1 */
            d ^= poly;          /*   d ^= poly */
        }
    }
    return q;                   /* return inverse */
}

uint8_t powmodpoly(int n)       /* generate 2^n % 9 bit poly */
{
uint16_t d = 0x0001u;           /* initial dividend = 2^0 */
    for(int i = 0; i < n; i++){
        d <<= 1;                /* shift dvnd left */
        if(d&0x0100){           /* if bit 8 set */
            d ^= poly;          /*   d ^= poly */
        }
    }
    return (uint8_t)d;          /* return remainder */
}

int main()
{
    printf("%04x\n", geninv());
    printf("%02x %02x %02x %02x %02x %02x %02x %02x %02x %02x\n",
           powmodpoly(0x00), powmodpoly(0x08), powmodpoly(0x10), powmodpoly(0x18),
           powmodpoly(0x20), powmodpoly(0x28), powmodpoly(0x30), powmodpoly(0x38),
           powmodpoly(0x40), powmodpoly(0x48));
    printf("%02x\n", powmodpoly(0x77));  /* 0xd9, cycles crc backwards 8 bits */
    return 0;
}

Long hand example for 2^0x10 / 0x107.

                            100000111    quotient
                  -------------------
divisor 100000111 | 10000000000000000    dividend
                    100000111
                    ---------
                          111000000
                          100000111
                          ---------
                           110001110
                           100000111
                           ---------
                            100010010
                            100000111
                            ---------
                                10101    remainder

I don't know how many registers you can have in your hardware design, but assume there are five 16 bit registers used to hold folded values, and either two or eight 8 bit registers (depending on how parallel the folding is done). Then following the Intel paper, you fold values for all 64 bytes, 8 bytes at a time, and only need one modulo operation. Register size, fold# = 16 bits, reg# = 8 bits. Note that powers of 2 modulo poly are pre-calculated constants.

foldv = prior buffer's folding value, equivalent to folded msg[-2 -1]
reg0 = foldv>>8
reg1 = foldv&0xFF
foldv  = reg0·((2^0x18)%poly)    advance by 3 bytes
foldv ^= reg1·((2^0x10)%poly)    advance by 2 bytes
fold0 = msg[0 1] ^ foldv         handling 2 bytes at a time
fold1 = msg[2 3]
fold2 = msg[4 5]
fold3 = msg[6 7]
for(i = 8; i < 56; i += 8){
    reg0  = fold0>>8
    reg1  = fold0&ff
    fold0  = reg0·((2^0x48)%poly)    advance by 9 bytes
    fold0 ^= reg1·((2^0x40)%poly)    advance by 8 bytes
    fold0 ^= msg[i+0 i+1]
    reg2  = fold1>>8                 if not parallel, reg0
    reg3  = fold1&ff                  and reg1
    fold1  = reg2·((2^0x48)%poly)    advance by 9 bytes
    fold1 ^= reg3·((2^0x40)%poly)    advance by 8 bytes
    fold1 ^= msg[i+2 i+3]
    ...
    fold3 ^= msg[i+6 i+7]
}
reg0  = fold0>>8
reg1  = fold0&ff
fold0  = reg0·((2^0x38)%poly)    advance by 7 bytes
fold0 ^= reg1·((2^0x30)%poly)    advance by 6 bytes
reg2  = fold1>>8                 if not parallel, reg0
reg3  = fold1&ff                  and reg1
fold1  = reg2·((2^0x28)%poly)    advance by 5 bytes
fold1 ^= reg3·((2^0x20)%poly)    advance by 4 bytes
fold2 ...                        advance by 3 2 bytes
fold3 ...                        advance by 1 0 bytes
foldv = fold0^fold1^fold2^fold3

Say the final buffer has 5 bytes:

foldv = prior folding value, equivalent to folded msg[-2 -1]
reg0 = foldv>>8
reg1 = foldv&0xFF
foldv  = reg0·((2^0x30)%poly)    advance by 6 bytes
foldv ^= reg1·((2^0x28)%poly)    advance by 5 bytes
fold0 = msg[0 1] ^ foldv
reg0  = fold0>>8
reg1  = fold0&ff
fold0  = reg0·((2^0x20)%poly)    advance by 4 bytes
fold0 ^= reg1·((2^0x18)%poly)    advance by 3 bytes
fold1 = msg[2 3]
reg2  = fold1>>8
reg3  = fold1&ff
fold1  = reg0·((2^0x10)%poly)    advance by 2 bytes
fold1 ^= reg1·((2^0x08)%poly)    advance by 1 bytes
fold2 = msg[4]                   just one byte loaded
fold3 = 0
foldv = fold0^fold1^fold2^fold3
now use the method above to calculate CRC(foldv)
rcgldr
  • 27,407
  • 3
  • 36
  • 61
  • Thanks a lot for the detailed response. How do we calculate the value of (2^64)%poly? Is it the same as CRC(2^64) since we are calculating the remainder? In the above example, I did CRC(0x1 0x0) (for (2^8)%107)) but I got 0x15 instead of 0x7 in your case. My register can be 8B wide, and I'm planning to have the precalculated CRC's in a ROM. – sharvil111 Jan 16 '22 at 00:47
  • Thanks a lot for all the details. From the given code, I understood the implementation of carryless multiplication. I can implement the required functionality with this approach. Thanks again! – sharvil111 Jan 16 '22 at 08:52
  • Thanks for the pseudo code. Two questions, (1) `uint16_t d = 0x8000u;` is because we want to find division of 2^10 (hex). For 2^8, 2^18, 2^20 (hex), we need wider vector right? (2) I simulated the code as it is and it is giving division of 2^10 (hex) as 0x107 as opposed to 0x15. Is there any issue in my understanding here? – sharvil111 Jan 18 '22 at 01:49
  • 1
    @sharvil111 - I updated my answer to simplify the code used to generate the constants. Although the "inverse" is 0x107, it is split up into 0x100 and 0x07, and the 0x100 part is handled by shifting left 8 bits, so only the 0x07 part is used for a carryless multiply. This means that geninv() could just return an 8 bit value, knowing that the 0x100 part is implied. In my repository for [crc64f](https://github.com/jeffareid/crc/tree/master/crc64f), rk07 (inverse) and rk08 (polynomial) have an implied 2^64 bit, which is handled in the barrett reduction step. – rcgldr Jan 18 '22 at 15:36
  • 1
    @sharvil111 - (2) - the quotient is 0x107, the remainder is 0x15. – rcgldr Jan 18 '22 at 17:03
  • Agree! I was able to generate the constants for the lookup! Thanks a lot for the help. :-) – sharvil111 Jan 20 '22 at 01:23
  • @sharvil111 - if you implement multiple multipliers, a multiply by constant only needs xor gates. – rcgldr Jan 20 '22 at 09:22
  • If generating a CRC, set the last byte of a message to 0, calculate the CRC as shown above, then multiply `(crc · 0xd9) % 107` to cycle it backwards 8 bits and store it in the last byte. Since 0x107 is not prime, 0x107 = 0xfd · 0x03 (carryless multiply), `2^0x7f%0x107` = 0x01. So logarithms are modulo 0x7f and not modulo 0xff. This is why cycling backwards 8 bits uses a constant of `2^0x77%0x107` (=0xd9) instead of `2^0xf7%0x107` – rcgldr Jan 21 '22 at 21:52
  • Apologies to be a nitpicker, I was trying to do an experiment with the earlier approach as a tryout which didn't work.. can you let me know where in the dot productwas I wrong? `Data=(0xFF 0x7), precalculated 2^8 % 107 = 7, CRC(0xFF) = 0xF3 CRC(0x07) = 0x15 (0xF3 · 0x7) = (0x2D9) CRC(FF07) = 0x2D9 xor 0x15= 0x2cc` which exceeds 8bits of buffer length and is not the correct CRC. What am I missing if the dot product exceeds 8 bit here? – sharvil111 Jan 23 '22 at 06:19
  • 1
    @sharvil111 - You missed a modulo step: CRC(0xFF07) = 0xFF0700%107 = C2. CRC(0xFF) = 0xFF00%107 = 0xF3. CRC(0x07) = 0x0700%107 = 0x15. Then to adjust the F3, use (F3 · 0x07)%0x107 = 0x2D9 %107 = 0xD7. 0xD7 xor 0x15 = 0xC2. This earlier method uses 1 multiply and 3 modulos (each modulo uses 2 multiplies). Using the later method in my answer reduces this to 2 multiplies and 1 modulo. 0xFF · 2^0x10%107 = 0xFF · 0x15 = 0x0CF3. 0x07 · 2^0x8%107 = 0x07 · 0x07 = 0x0015. 0xCF3 xor 0x0015 = 0x0CE6. 0x0CE6 % 107 = 0xC2. – rcgldr Jan 23 '22 at 09:30
  • Thanks again! Agree, I'm keeping the latter method in practice.Can you kindly point me to the reference `each modulo uses 2 multiplies` – sharvil111 Jan 23 '22 at 11:30
  • Look at the CRC(V) part of my answer. Note I was using · to mean carryless multiply, not a dot product. The modulo operation starts with the 16 bit value U. To calculate U%0x107, Q = U / 0x107, implemented by multiplying by the inverse as shown in my answer. X = Q * 0x107. CRC = the lower 8 bits of (U xor X). – rcgldr Jan 23 '22 at 11:39
1

As shown in your diagram, you need to calculate the CRC of 0x05 0x00, (A,0), and the CRC of 0x00 0x07, (0,B), and then exclusive-or those together. Calculating on the site you linked, you get 0x41 and 0x15 respectively. Exclusive-or those together, and, voila, you get 0x54, the CRC of 0x05 0x07.

There is a shortcut for (0,B), since for this CRC, the CRC of a string of zeros is zero. You can calculate the CRC of just 0x07 and get the same result as for 0x00 0x07, which is 0x15.

See crcany for how to combine CRCs in general. crcany will generate C code to compute any specified CRC, including code to combine CRCs. It employs a technique that applies n zeros to a CRC in O(log(n)) time instead of O(n) time.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
  • Thank you for the inputs! Can I split (0x5 0x0) further into smaller chunks? Calculating CRC for 0x5 individually and multiplying the result with the CRC of (0x1 0x0) or something else (CRC of individual multiplicands). This is because my actual data is 64B and I want to process smaller chunks (8B) in parallel (in this example calculating CRC for (0x5 0x0) still keeps CRC module data width as 16 bits). – sharvil111 Jan 15 '22 at 16:42
  • 1
    See update to answer. – Mark Adler Jan 15 '22 at 17:41