49

I've seen 8-bit, 16-bit, and 32-bit CRCs.

At what point do I need to jump to a wider CRC?

My gut reaction is that it is based on the data length:

  1. 1-100 bytes: 8-bit CRC
  2. 101 - 1000 bytes: 16-bit CRC
  3. 1001 - ??? bytes: 32-bit CRC

EDIT: Looking at the Wikipedia page about CRC and Lott's answer, here' what we have:

<64 bytes: 8-bit CRC

<16K bytes: 16-bit CRC

<512M bytes: 32-bit CRC

Robert Deml
  • 12,390
  • 20
  • 65
  • 92

7 Answers7

47

It's not a research topic. It's really well understood: http://en.wikipedia.org/wiki/Cyclic_redundancy_check

The math is pretty simple. An 8-bit CRC boils all messages down to one of 256 values. If your message is more than a few bytes long, the possibility of multiple messages having the same hash value goes up higher and higher.

A 16-bit CRC, similarly, gives you one of the 65,536 available hash values. What are the odds of any two messages having one of these values?

A 32-bit CRC gives you about 4 billion available hash values.

From the wikipedia article: "maximal total blocklength is equal to 2**r − 1". That's in bits. You don't need to do much research to see that 2**9 - 1 is 511 bits. Using CRC-8, multiple messages longer than 64 bytes will have the same CRC checksum value.

S.Lott
  • 384,516
  • 81
  • 508
  • 779
  • 2
    This is accurate and helpful if the CRC is being used to detect changes to a file. However, if it's being used as a digest to detect duplicates among files, then it's more complicated. In specific, the birthday paradox requires us to factor in how many distinct values we expect to have. – Steven Sudit Feb 23 '10 at 22:17
  • 1
    @Steven Sudit: Correct. Sadly the question is too vague to determine anything about the use of the CRC. – S.Lott Feb 24 '10 at 03:46
  • 2
    I think that *any* message loner than the CRC width (r-1, and not 2^r-1) will have multiple messages mapped to the same checksum. IOW, any message of more than a byte long, will have overlapping CRC8 mappings. I think (one of) the challenge(s) is to design the mapping such that the distribution of message strings over the hashes is uniform. – ysap Apr 14 '16 at 09:51
22

The effectiveness of a CRC is dependent on multiple factors. You not only need to select the SIZE of the CRC but also the GENERATING POLYNOMIAL to use. There are complicated and non-intuitive trade-offs depending on:

  • The expected bit error rate of the channel.
  • Whether the errors tend to occur in bursts or tend to be spread out (burst is common)
  • The length of the data to be protected - maximum length, minimum length and distribution.

The paper Cyclic Redundancy Code Polynominal Selection For Embedded Networks, by Philip Koopman and Tridib Chakravarty, publised in the proceedings of the 2004 International Conference on Dependable Systems and Networks gives a very good overview and makes several recomendations. It also provides a bibliography for further understanding.

http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

Mary Ann Mojica
  • 221
  • 2
  • 2
7

The choice of CRC length versus file size is mainly relevant in cases where one is more likely to have an input which differs from the "correct" input by three or fewer bits than to have a one which is massively different. Given two inputs which are massively different, the possibility of a false match will be about 1/256 with most forms of 8-bit check value (including CRC), 1/65536 with most forms of 16-bit check value (including CRC), etc. The advantage of CRC comes from its treatment of inputs which are very similar.

With an 8-bit CRC whose polynomial generates two periods of length 128, the fraction of single, double, or triple bit errors in a packet shorter than that which go undetected won't be 1/256--it will be zero. Likewise with a 16-bit CRC of period 32768, using packets of 32768 bits or less.

If packets are longer than the CRC period, however, then a double-bit error will go undetected if the distance between the erroneous bits is a multiple of the CRC period. While that might not seem like a terribly likely scenario, a CRC8 will be somewhat worse at catching double-bit errors in long packets than at catching "packet is totally scrambled" errors. If double-bit errors are the second most common failure mode (after single-bit errors), that would be bad. If anything that corrupts some data is likely to corrupt a lot of it, however, the inferior behavior of CRCs with double-bit errors may be a non-issue.

supercat
  • 77,689
  • 9
  • 166
  • 211
4

I think the size of the CRC has more to do with how unique of a CRC you need instead of of the size of the input data. This is related to the particular usage and number of items on which you're calculating a CRC.

Samuel Neff
  • 73,278
  • 17
  • 138
  • 182
3

The CRC should be chosen specifically for the length of the messages, it is not just a question of the size of the CRC: http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

starblue
  • 55,348
  • 14
  • 97
  • 151
2

Here is a nice "real world" evaluation of CRC-N http://www.backplane.com/matt/crc64.html

I use CRC-32 and file-size comparison and have NEVER, in the billions of files checked, run into a matching CRC-32 and File-Size collision. But I know a few exist, when not purposely forced to exist. (Hacked tricks/exploits)

When doing comparison, you should ALSO be checking "data-sizes". You will rarely have a collision of the same data-size, with a matching CRC, within the correct sizes.

Purposely manipulated data, to fake a match, is usually done by adding extra-data until the CRC matches a target. However, that results in a data-size that no-longer matches. Attempting to brute-force, or cycle through random, or sequential data, of the same exact size, would leave a real narrow collision-rate.

You can also have collisions within the data-size, just by the generic limits of the formulas used, and constraints of using bits/bytes and base-ten systems, which depends on floating-point values, which get truncated and clipped.

The point you would want to think about going larger, is when you start to see many collisions which can not be "confirmed" as "originals". (When they both have the same data-size, and (when tested backwards, they have a matching CRC. Reverse/byte or reverse/bits, or bit-offsets)

In any event, it should NEVER be used as the ONLY form of comparison, just for a quick form of comparison, for indexing.

You can use a CRC-8 to index the whole internet, and divide everything into one of N-catagories. You WANT those collisions. Now, with those pre-sorted, you only have to check one of N-directories, looking for "file-size", or "reverse-CRC", or whatever other comparison you can do to that smaller data-set, fast...

Doing a CRC-32 forwards and backwards on the same blob of data is more reliable than using CRC-64 in just one direction. (Or an MD5, for that matter.)

JD_Mortal
  • 29
  • 2
  • Doing a CRC-32 forwards and backward you mean doing CRC two times on a file? – Arash Jul 06 '20 at 06:19
  • Yes, @Arash it seems he means a file. An advantage of CRC32 or MD5 is they can be calculated as the data passes. Reversing the data means you have to store it all buffered until you go back through the bits in reverse order. MD5 is more calculation intensive - more designed for signing a message than checking for errors because CRCs are easier to contrive a set of data that will match a particular CRC. – Ted Shaneyfelt Oct 17 '21 at 20:36
  • It is possible to calculate such a "reverse CRC" in the forward direction. You don't need to buffer. – Mark Adler Apr 17 '22 at 05:19
0

You can detect a single bit error with a CRC in any size packet. Detecting double bit errors or correction of single bit errors is limited to the number of distinct values the CRC can take, so for 8 bits, that would 256; for 16 bits, 65535; etc. 2^n; In practice, though, CRCs actually take on fewer distinct values for single bit errors. For example what I call the 'Y5' polynomial, the 0x5935 polynomial only takes on up to 256 different values before they repeat going back farther, but on the other hand it is able to correct double bit errors that distance, which is 30 bytes plus 2 bytes for errors in the CRC itself.

The number of bits you can correct with forward error correction is also limited by the Hamming Distance of the polynomial. For example, if the Hamming distance is three, you have to flip three bits to change from a set of bits that represents one valid message with matching CRC to another valid message with its own matching CRC. If that is the case, you can correct one bit with confidence. If the Hamming distance were 5, you could correct two bits. But when correcting multiple bits, you are effectively indexing multiple positions, so you need twice as many bits to represent the indexes of two corrected bits rather than one.

With forward error correction, you calculate the CRC on a packet and CRC together, and get a residual value. A good message with zero errors will always have the expected residual value (zero unless there's a nonzero initial value for the CRC register), and each bit position of error has a unique residual value, so use it to identify the position. If you ever get a CRC result with that residual, you know which bit (or bits) to flip to correct the error.

Ted Shaneyfelt
  • 745
  • 5
  • 14