17

I use this trivial function to calculate the CRC checksum of a given file:

long i, j = 0;
int k = 0;
uint crc = 0xFFFFFFFF;
FileInfo file_info = new FileInfo(file);
byte[] file_buffer = new byte[32768];

FileStream file_stream = new FileStream(@file, FileMode.Open);
while ((i = file_stream.Read(file_buffer, 0, file_buffer.Count())) > 0)
{
    for (j = 0; j < i; j++)
    {
        uint before = crc;
        k = (int)((crc ^ file_buffer[j]) & 0x000000FFL);
        uint after = (uint)((crc >> 8) & 0x00FFFFFFL) ^ crc32_table[k];
        crc = after;
        uint test = (uint)((crc << 8) & 0x00FFFFFFL) ^ crc32_table[k];
        MessageBox.Show((~crc).ToString("X"));
    }
}
file_stream.Close();
return ~crc;

My question is this: Say I have a large file, of say 100MB. Is there any link between a CRC-32 calculation of the first 50MB and the last 50MB and the CRC-32 calculation of the 100MB file?

The reason I'm asking, is I have some very large files (~10GB give or take) which take some time to generate, but while they're being generated, most parts remain static, however, parts in the middle (known point) and right at the start (header, also known part/length). Calculating a CRC-32 checksum of a 10GB file takes quite some time, so I was wondering if there was any way to do it in chunks?

Nayuki
  • 17,911
  • 6
  • 53
  • 80
Mik
  • 199
  • 2
  • 5
  • 4
    Yes, it is possible. Just try understand the code and you will see why. – leppie Jun 27 '11 at 11:16
  • 8
    Thanks. Would you mind elaborating a bit? I've tried using the previous crc value as: private uint crc(string file, uint previous_value = 0xFFFFFFFF), but I get these results: `a: 158094AD b: 68CD9474 ab: CD530E90 b2: 42A6F4F3`, where b2 is crc with a basevalue of a's crcvalue. Sorry! My bad. I accidentally used crc, instead of negating it back (~crc). Works. Thanks a lot, leppie :) – Mik Jun 27 '11 at 11:26
  • Cool! Glad to help you 'see' the answer :) – leppie Jun 27 '11 at 11:41
  • 7
    could you please post your last comment as an answer and accept it? That way the question stops showing up as unanswered. – Peter G. Jun 30 '11 at 09:12
  • Normally you can not parallelize a CRC calculation. However, if you always use the same calculation, you can simply calculate separate CRC values over several smaller chunks of the file (using a seed value of crc=0xFFFFFFFF for each chunk), and then XOR all of the individual CRCs together. This will give you a reasonably good checksum. But it will not be a traditional CRC, so other tools will not be able to calculate it. – Mark Lakata Jul 11 '11 at 19:36
  • Googling for parallel crc32: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.75.387&rep=rep1&type=pdf – Vinicius Kamakura Jul 11 '11 at 22:07
  • 1
    @Mik Sumit your revised code and accept it as the answer. – Nahydrin Jul 11 '11 at 22:07

4 Answers4

7

Yes. See crc32_combine_() in zlib for an example.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
Mark Adler
  • 101,978
  • 13
  • 118
  • 158
4

This equation is key:

CRC(a XOR b) == CRC(a) XOR CRC(b)

Suppose you want to compute the CRC of the following message:

"Always desire to learn something useful."

There exist functions to compute the CRC as:

crc_join(crc_part1("Always desire to lea"),
         crc_part2("rn something useful."))

If crc_part1 and crc_part2 zeros pad (\0) their arguments as shown, crc_join is just XOR.

crc_part1 = crc("Always desire to lea\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0")
crc_part2 = crc("\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0rn something useful.")

The trailing zeros can be accounted for in crc_part1 with lookup tables. The leading zeros can be ignored in crc_part2.

References:

  1. High-speed Parallel Architecture for Software-based CRC
    Youngju. Do, Sung-Rok. Yoon, Taekyu. Kim, Kwang Eui. Pyun and Sin-Chong. Park
  2. https://en.wikipedia.org/wiki/Cyclic_redundancy_check
matter
  • 57
  • 1
  • 2
  • 2
    crc("WORD WORD ") != crc ("WORD ") XOR crc ("WORD ") and that's what really matters. That's why I find your approach somehow misleading. – Pat Jul 16 '20 at 16:40
4

It is indeed possible to parallelize the CRC-32 computation, but the details are messy and I'd need to spend about a day to come up with the code.

Let's look at the basic CRC algorithm, where there is no negation and no bit reversal.

For the string of bytes that you want to compute the CRC of, let's call it the message. The basic idea is that you treat the message as a polynomial in GF(2), and you compute its remainder modulo the CRC polynomial.

The basic CRC algorithm is additive/linear. If you have two messages a and b of the same length, then CRC(a XOR b) = CRC(a) XOR CRC(b).

Furthermore, if you pad the message on the right side with n zeros, the new CRC will be the old CRC times x^n mod the CRC polynomial.

 

With all that said, the only way to solve your problem is to really understand the mathematics behind the CRC algorithm and write your own custom code. Here's a long but very complete explanation of CRCs: http://www.ross.net/crc/download/crc_v3.txt

Nayuki
  • 17,911
  • 6
  • 53
  • 80
  • Thank you so much for providing such a detailed and great introduction to the concept. – Mik Jul 12 '11 at 14:43
  • In another post, I explained exactly how CRC-32 is different from polynomial division (namely, the XOR, padding, and reversal): http://stackoverflow.com/questions/5047494/python-crc32-woes/6672302#6672302 – Nayuki Jul 12 '11 at 23:14
  • I wondered If I do multiple CRCs in different network layers what would be the undetected error probably over the whole? and is doing 32bit on packet size, and doing 64bit on chunk size (contains multiple packets) at the same time helpful for detecting errors? – Arash Oct 05 '20 at 04:46
2

I know this is an old question but this is what I use for "chunking" CRC calculations in case this helps anyone:

public static class Crc32Checksum
{
    private static readonly uint[] Table = GenerateTable();

    /// <summary>
    /// Calculates a CRC32 value for the data given.
    /// </summary>
    /// <param name="data">Data contents</param>
    /// <param name="offset">Byte offset to start reading</param>
    /// <param name="count">Number of bytes to read</param>
    /// <returns>The computed CRC32 value.</returns>
    public static int Calculate(byte[] data, int offset, int count) 
        => Calculate(0, data, offset, count);

    /// <summary>
    /// Calculates a new CRC32 value given additional data for the current CRC value.
    /// </summary>
    /// <param name="currentCrc">The current CRC value to start with</param>
    /// <param name="data">Additional data contents</param>
    /// <param name="offset">Byte offset to start reading</param>
    /// <param name="count">Number of bytes to read</param>
    /// <returns>The computed CRC32 value.</returns>
    public static int Calculate(int currentCrc, byte[] data, int offset, int count)
    {
        unchecked
        {
            uint crc = ~(uint)currentCrc;

            for (int i = offset, end = offset + count; i < end; i++)
                crc = (crc >> 8) ^ Table[(crc ^ data[i]) & 0xFF];

            return (int)~crc;
        }
    }

    private static uint[] GenerateTable()
    {
        unchecked
        {
            var table = new uint[256];

            const uint poly = 0xEDB88320;

            for (uint i = 0; i < table.Length; i++)
            {
                uint crc = i;

                for (int j = 8; j > 0; j--)
                {
                    if ((crc & 1) == 1)
                        crc = (crc >> 1) ^ poly;
                    else
                        crc >>= 1;
                }

                table[i] = crc;
            }

            return table;
        }
    }
}
Mike Marynowski
  • 3,156
  • 22
  • 32