147

When is it appropriate to use CRC for error detection versus more modern hashing functions such as MD5 or SHA1? Is the former easier to implement on embedded hardware?

Gili
  • 86,244
  • 97
  • 390
  • 689

14 Answers14

132

CRC works fine for detecting random errors in data that might occur, for example, from network interference, line noise, distortion, etc.

CRC is computationally much less complex than MD5 or SHA1. Using a hash function like MD5 is probably overkill for random error detection. However, using CRC for any kind of security check would be much less secure than a more complex hashing function such as MD5.

And yes, CRC is much easier to implement on embedded hardware, you can even get different packaged solutions for this on IC.

Update Yes, this answer is old. Please don't use SHA1 or MD5 for security purposes ;)

defines
  • 10,229
  • 4
  • 40
  • 56
  • I there a way of getting a 32-bit MD5 or SHA1 hash? As you said, 128 bits for error detection sounds a bit overkill. Why don't low-level communication protocols use MD5 or SHA1? Is it because the hash size is too big relative to the packet size? – Gili Jun 15 '09 at 16:57
  • 1
    @gili: you can always just xor the dwords together to get a single resulting dword. – Blindy Jun 17 '09 at 09:09
  • 29
    To reduce any long hash to 32 bits, just take the first 32 bits. – orip May 24 '10 at 20:44
  • 2
    If security is you goal then you should never ever use `MD5`, `SHA-1` should also be avoided, some variant of `SHA-2` is recommended. – Peter Mar 17 '17 at 10:26
  • @defines: The computational complexity of MD5 is O(n) according to this thread (https://stackoverflow.com/questions/43625569/time-complexity-of-md5). What is the computational complexity of CRC? Given that it must look at all the data, how could it be better than O(n)? I am just asking since you write that MD5 was more complex than CRC,and I wasn't sure if you meant computational complexity. – avgJoe Jan 18 '21 at 20:02
  • 1
    Big-O notation is only part of the picture. There is no such thing as O(2N) because the constant multiples have no meaning in big-O notation. So even though it is a linear-complexity algorithm, there are more (constant-numbered) steps per iteration. Perhaps complexity was also not the right word :) https://en.wikipedia.org/wiki/Big_O_notation#Multiplication_by_a_constant – defines Jan 19 '21 at 21:33
  • CRCs are not just "fine" for detecting random errors. They are better at detecting random errors than random hashes of the same length. At short lengths, they are so much better than you'd be crazy to use a random hash. I just wrote this up in [another answer](/a/72580031). – benrg Jun 10 '22 at 21:37
  • I'm not sure what you mean by random hashes, but CRC is of fixed length and is essentially a hashing function. If the error exists in the check value, the CRC will be invalid anyway. A higher-length function will definitely be "better" (more effective at error detection) but also probably overkill, requiring more instructions. – defines Jun 12 '22 at 02:22
39

CRC is designed against unintentional changes in the data. That is, it's good for detecting unintentional errors, but will be useless as a way of making sure a data was not maliciously handled.

Also see this.

Socowi
  • 25,550
  • 3
  • 32
  • 54
Liran Orevi
  • 4,755
  • 7
  • 47
  • 64
  • Most important part from the link in this answer: "(...) even a 2048-bit CRC would be cryptographically much less secure than a 128-bit MD5" – Marc.2377 May 04 '16 at 21:52
  • 4
    While the answer is still correct, MD5 and SHA1 are at the same level of security nowadays. In other words, only good for detecting unintentional errors. – Piskvor left the building Feb 24 '17 at 09:17
23

I found a study that shows how inappropriate CRC hashes are for hash tables. It also explains the actual characteristics of the algorithm. The study also includes evaluation of other hash algorithms and is a good reference to keep.

UPDATE

It seems the site is down. The internet archive has a copy though.

UPDATE 2

Oh dear. It turns out the study may have been faulty around the conclusions on CRC for use as a hash. Thanks @minexew for the link.

Andre Luus
  • 3,692
  • 3
  • 33
  • 46
  • Link is broken. Maybe you can write the explanation yourself? If not the answer is useless. – ceving Jan 20 '14 at 14:54
  • Okay, I'll include the conclusion in my answer. – Andre Luus Jan 21 '14 at 06:47
  • Weird, according to the benchmark [here](http://programmers.stackexchange.com/a/145633/161451), CRC actually does pretty well in terms of speed and number of collisions. – ostrokach Sep 15 '15 at 10:41
  • Very interesting indeed. I had to look over the study I linked to again, but if I had to guess it must be because of the different testing implementations. If I had to make a decision, I'd go for the advice from the study, it appears to be more scientifically sound. – Andre Luus Sep 15 '15 at 14:41
  • In my experience hashing millions of URLs, CRC64 collided 8 times and MD5 collided 5. Obviously MD5 was better, but CRC64 was a great and much faster and simpler hash. – J. Dimeo Jun 12 '17 at 04:09
  • CRC as a hash is merely a side effect of its decent error-detection properties. Unless you're expecting burst or single bit errors, any non-crypto hash can be much faster than CRC. – bryc Apr 16 '18 at 14:05
  • CRC can show artifacts for very short message sizes, which would be the case when using it for hash tables. That said, I use them for hash tables, but not blindly. – ilgitano Aug 15 '18 at 05:41
  • Bret's study has been debunked, see: https://github.com/Michaelangel007/crc32/blob/master/README.md#crc32-hashing-confusion – minexew Jun 26 '21 at 17:07
  • Well - how about that @minexew. I'll update my answer. – Andre Luus Jul 05 '21 at 16:19
20

I ran every line of this PHP code in 1.000.000 loop. Results are in comments (#).

hash('crc32', 'The quick brown fox jumped over the lazy dog.');#  750ms   8 chars
hash('crc32b','The quick brown fox jumped over the lazy dog.');#  700ms   8 chars
hash('md5',   'The quick brown fox jumped over the lazy dog.');#  770ms  32 chars
hash('sha1',  'The quick brown fox jumped over the lazy dog.');#  880ms  40 chars
hash('sha256','The quick brown fox jumped over the lazy dog.');# 1490ms  64 chars
hash('sha384','The quick brown fox jumped over the lazy dog.');# 1830ms  96 chars
hash('sha512','The quick brown fox jumped over the lazy dog.');# 1870ms 128 chars

My conclusion:

  • Use "crc32b" when you need http://en.wikipedia.org/wiki/Cyclic_redundancy_check and you do not care about security.
  • Use "sha256" (or higher) when you need added security layer.

  • Do not use "md5" or "sha1" because they have:

    1. some security issues when you care about security
    2. longer hash string and are slower than "crc32b" when all you need is CRC
Martin
  • 1,312
  • 1
  • 15
  • 18
  • 1
    Not really. **echo hash('crc32', 'The quick brown fox jumped over the lazy dog.');** echoes "413a86af", what is 8 character long string. Btw, it is 32bit number stored in HEX format. For example, "sha256" has 256bits hash, again stored as HEX, what gives 64 character long string. – Martin Aug 06 '12 at 12:28
  • 50
    These results are very deceiving. When these hashing algorithms are applied to a large data set ([War and Peace](http://en.wikipedia.org/wiki/War_and_Peace) instead of `"The quick brown fox jumped over the lazy dog."`) you would see how much faster CRC is than MD5. – ubiquibacon Aug 07 '12 at 14:50
  • 2
    There is an intermediate case (duplicate checking in libraries) where MD5/Sha1 are the correct solution: they don't need to handle the case where there's an adversary carefully crafting the vanishingly unlikely hash collision, but they do need to handle accidental collisions. So: Detecting bit errors and corruption: CRC32 Detecting collisions in libraries: MD5/SHA1 Adversarial applications: Sha256 and above. Of course, if you have a library with billions of entries, then you will probably need to increase your hash bits, too. – Dewi Morgan May 24 '14 at 23:29
  • PHP? on an ARM platform, embedded code, 16MHz a CRC32 of 46 bytes, maybe 12 microseconds. That has hardware assist. Even hardware assisted AES would be several hundred times slower. Unassisted lookup table CRC should still come in around 50 microseconds. – ilgitano Aug 15 '18 at 05:48
13

It all depends on your requirements and expectation.

Here are quick brief differences between these hash function algorithms:

CRC (CRC-8/16/32/64)

  • is not a cryptographic hashing algorithm (it's using a linear function based on cyclic redundancy checks)
  • can produce either 9, 17, 33 or 65 bits
  • not intended to be used for cryptographic purposes since makes no cryptographic guarantees,
  • unsuitable for use in digital signatures, because it's easily reversible2006,
  • should not be used for encryption purposes,
  • different strings can generate the collision,
  • invented in 1961 and used in Ethernet and many other standards,

MD5

  • is a cryptographic hash algorithm,
  • producing a 128-bit (16-byte) hash value (32 digit hexadecimal numbers)
  • it is a cryptographic hash, but is considered deprecated if you worry about security,
  • there are known strings which have the same MD5 hash value
  • can be used for encryption purposes,

SHA-1

  • is a cryptographic hash algorithm,

  • produces a 160-bit (20-byte) hash value known as a message digest

  • it is a cryptographic hash and since 2005 it's no longer considered secure,

  • can be used for encryption purposes,

  • an example of a sha1 collision has been found

  • first published in 1993 (as SHA-0), then 1995 as SHA-1,

  • series: SHA-0, SHA-1, SHA-2, SHA-3,

    In summary, using SHA-1 is no longer considered secure against well-funded opponents, because in 2005, cryptanalysts found attacks on SHA-1 which suggests it may be not secure enough for ongoing useschneier. U.S. NIST advise that federal agencies should stop using SHA1-1 for application which require collision resistance and must use SHA-2 after 2010NIST.

Therefore, if you're looking for simple and quick solution for checking the integrity of a files (against the corruption), or for some simple caching purposes in terms of performance, you can consider CRC-32, for hashing you may consider to use MD5, however if you're developing professional application (which should be secure and consistent), to avoid any collision probabilities - use SHA-2 and above (such as SHA-3).

Performance

Some simple benchmark test in PHP:

# Testing static text.

$ time php -r 'for ($i=0;$i<1000000;$i++) crc32("foo");'
real    0m0.845s
user    0m0.830s
sys     0m0.008s

$ time php -r 'for ($i=0;$i<1000000;$i++) md5("foo");'
real    0m1.103s
user    0m1.089s
sys     0m0.009s

$ time php -r 'for ($i=0;$i<1000000;$i++) sha1("foo");'
real    0m1.132s
user    0m1.116s
sys   0m0.010s

# Testing random number. 

$ time php -r 'for ($i=0;$i<1000000;$i++) crc32(rand(0,$i));'
real    0m1.754s
user    0m1.735s
sys     0m0.012s\

$ time php -r 'for ($i=0;$i<1000000;$i++) md5(rand(0,$i));'
real    0m2.065s
user    0m2.042s
sys     0m0.015s

$ time php -r 'for ($i=0;$i<1000000;$i++) sha1(rand(0,$i));'
real    0m2.050s
user    0m2.021s
sys     0m0.015s

Related:

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743
  • 1
    The benchmarks are useless because the strings are so short; the run time is dominated by unrelated overhead. You failed to explain the crucial fact that CRCs are better at detecting uncommon, random errors than random hashes are: see [this answer](/a/72580031). – benrg Jun 10 '22 at 21:42
11

For CRC information on implementation, speed and reliability see A painless guide to CRC error detection algorithms. It has everything on CRCs.

Unless somebody is going to try and modify your data maliciously and hide the change CRC is sufficient. Just use a "Good" (standard) polinomial.

Gerhard
  • 6,850
  • 8
  • 51
  • 81
  • 1
    CRCs are not just "sufficient"; they're far superior to random hashes at short lengths for certain types of errors. I just wrote [another answer](/a/72580031) about this. – benrg Jun 10 '22 at 21:39
8

You do not say what it is that you are trying to protect.

A CRC is often used in embedded systems as a check against accidental data corruption as opposed to preventing malicious system modification. Examples of the places where a CRC can be useful is to validate an EPROM image during system initialisation to guard against firmware corruption. The system bootloader will calculate the CRC for the application code and compare with the stored value before allowing the code to run. This protects against the possibility of accidental program corruption or a failed download.

A CRC can also be used in a similar manner to protect configuration data stored in FLASH or EEPROM. If the CRC is incorrect then the data can be flagged as invalid and a default or backup data set used. The CRC may be invalid due to device failure or if the user removed power during an update of the configuration data store.

There have been comments that a hash provides greater probability of detecting corruption than a CRC with multiple bit errors. This is true, and the decision on whether or not to use a 16 or 32 bit CRC will hinge upon the safety consequences of a corrupted data block being used and whether you can justify the 1 in 2^16 or 2^32 chance of a data block being incorrectly declared valid.

Many devices have a built in CRC generator for standard algorithms. The MSP430F5X series from Texas have a hardware implementation of the CRC-CCITT Standard.

uɐɪ
  • 2,540
  • 1
  • 20
  • 23
  • "This is true" - no, the exact opposite is true. See [this answer](/a/72580031). – benrg Jun 10 '22 at 21:45
  • @benrg Doesn't your answer only apply to when you truncate a hash? They never said that truncating hashes is a good idea. – TheTechRobo the Nerd Aug 18 '23 at 20:43
  • 1
    @TheTechRobotheNerd True, but it hardly seems fair to compare a short CRC to a long cryptographic hash, if that's what they're doing. A CRC always outperforms a cryptographic hash of the same length. Also, they explicitly said that a CRC-16 or CRC-32 has a 1 in 2^16 or 2^32 chance of missing an error, which is incorrect for typical error distributions. – benrg Aug 18 '23 at 22:17
  • @benrg Yeah, that makes sense. – TheTechRobo the Nerd Aug 18 '23 at 23:50
5

Lets start with the basics.

In Cryptography, a hashing algorithm converts many bits to fewer bits through a digest operation. Hashes are used to confirm integrity of messages and files.

All hashing algorithms generate collisions. A collision is when several many-bit combinations produce the same fewer bit output. The cryptographic strength of a hashing algorithm is defined by the inability for an individual to determine what the output is going to be for a given input because if they could they could construct a file with a hash that matches a legitimate file and compromise the assumed integrity of the system. The difference between CRC32 and MD5 is that MD5 generates a larger hash that's harder to predict.

When you want to implement message integrity - meaning the message hasn't been tampered with in transit - the inability to predict collisions is an important property. A 32-bit hash can describe 4 billion different messages or files using 4 billion different unique hashes. If you have 4 billion and 1 files, you are guaranteed to have 1 collision. 1 TB Bitspace has the possibility for Billions of Collisions. If I'm an attacker and I can predict what that 32 bit hash is going to be, I can construct an infected file that collides with the target file; that has the same hash.

Additionally if I'm doing 10mbps transmission then the possibility of a packet getting corrupted just right to bypass crc32 and continue along the to the destination and execute is very low. Lets say at 10mbps I get 10 errors\second. If I ramp that up to 1gbps, now I'm getting 1,000 errors per second. If I ram up to 1 exabit per second, then I have an error rate of 1,000,000,000 errors per second. Say we have a collision rate of 1\1,000,000 transmission errors, Meaning 1 in a million transmission errors results in the corrupt data getting through undetected. At 10mbps I'd get error data being sent every 100,000 seconds or about once a day. At 1gbps it'd happen once every 5 minutes. At 1 exabit per second, we're talking several times a second.

If you pop open Wireshark you'll see your typical Ethernet header has a CRC32, your IP header has a CRC32, and your TCP Header has a CRC32, and that's in addition to the what the higher layer protocols may do; e.g. IPSEC might use MD5 or SHA for integrity checking in addition to the above. There are several layers of error checking in typical network communications, and they STILL goof now and again at sub 10mbps speeds.

Cyclic Redundancy Check (CRC) has several common versions and several uncommon but generally is designed to just tell when a message or file has been damaged in transit (multiple bits flipping). CRC32 by itself is not a very good error checking protocol by today's standards in large, scalar enterprise environments because of the collision rate; the average users hard-drive can have upwards of 100k files, and file-shares on a company can have tens of millions. The ratio of hash-space to the number of files is just too low. CRC32 is computationally cheap to implement whereas MD5 isn't.

MD5 was designed to stop intentional use of collisions to make a malicious file look benign. It's considered insecure because the hashspace has been sufficiently mapped to enable some attacks to occur, and some collisions are predictable. SHA1 and SHA2 are the new kids on the block.

For file verification, Md5 is starting to be used by a lot of vendors because you can do multigigabyte files or multiterrabyte files quickly with it and stack that on top of the general OS's use and support of CRC32's. Do not be surprised if within the next decade filesystems start using MD5 for error checking.

SensorSmith
  • 1,129
  • 1
  • 12
  • 25
bobinator
  • 51
  • 1
  • 1
5

I came across a use of CRC recently which was smart. The author of the jdupe file duplication identification and removal tool (the same author of the popular exif tool jhead) uses it during the first pass through the files. A CRC is computed on the first 32K of each file to mark files that appear to be the same, also the files must have the same size. These files are added to a list of files on which to do a full binary comparison. It speeds up checking large media files.

John Wright
  • 2,418
  • 4
  • 29
  • 34
  • One problem with that approach is when run on a file which contains an enbedded CRC32 within it, the resulting CRC may be independent of the data in the file (since if the data changes, the CRC32 will be changed so as to cancel out the difference). Munging the data in some simple way before computing the CRC32 would avoid that problem. – supercat Mar 29 '11 at 18:56
  • 2
    @supercat - I really don't believe that this is actually an issue. If a file contains a crc32 header which is the crc32 of the rest of the file, then when the file is updated each bit in the crc32 header will have approximately a 50% chance of being different. The changes in the header should follow a fairly random distribution. I fail to see how this is going to result in the CRC32(header + data) always being the same, or in any way not dependent on the data portion of the file. – teratorn May 28 '11 at 12:54
  • @teratorn: I've seen a number of files which have a CRC32 at the end, computed in such a way that the CRC32 of the entire file, computed using some particular seed constant, will always be some other constant value. This is pretty common with things like binary code images. If the Acme 1000 DVD player uses fixed-size code images for firmware upgrades, and expects every code image to have a certain CRC32, then a routine which computes the CRC32's of various files would be unable to distinguish different code images for the Acme 1000. – supercat May 28 '11 at 23:18
  • Point of the CRC in that case is to to quickly identify that the files are different. If the CRC comes back the same, you now have to do an expensive binary compare, so an embedded CRC doesn't break the algorithm. Could happen that some files end up being binary compared because the CRC first pass says they MIGHT be the same, but unlikely to be many of those, and you can avoid it by using a custom polynomial. – ilgitano Aug 15 '18 at 06:01
5

Short CRCs are much better than pseudorandom hashes of the same length at detecting random errors.

This question has accumulated a large number of answers, most unnecessary, over the years, but none has yet pointed out this crucial fact. You should never use a short random hash (such as truncated MD5 or SHA-1) to catch occasional flipped bits, even if you can afford the computational cost, because the false-negative rate will be high.

Here's a worked example. Say your messages are 12 octets (96 bits) of payload plus 1 octet for error detection. Also suppose that each bit has an independent 1-in-10,000 chance of being corrupted (flipped) in transit. Note that means that roughly 1% of packets will have at least one flipped bit, roughly 0.01% of packets will have at least 2 flipped bits, and so on.

If the error-detection bits are a pseudorandom hash (such as MD5 or SHA-1 truncated to 8 bits), then corruption confined to the check bits will always be detected, while corruption not confined to those bits will be detected around 255/256 of the time. In all, roughly (12/13)×(1/256) ≈ 0.36% of all corrupted packets will evade detection.

If the error-detection bits are a simple checksum (sum of the other bytes mod 256), then all single-bit-flip errors (99% of the total) will be detected, and of the remaining 1%, better than 7/8 will be detected. Less than 0.13% of corrupted packets will be missed. So even a simple checksum outperforms a random hash.

If the error-detection bits are a CRC-8 with an appropriately chosen polynomial (such as CRC-8-CCITT), then all errors of 1, 2, or 3 flipped bits will be detected, and roughly 127/128 of other errors will be detected. Less than 0.00000002% of corrupted packets will be missed.

CRCs are not just used because they're fast to compute (although they are—especially in hardware) but also because they're really really good at detecting certain types of errors. Even if you're working with hardware that can compute a truncated MD5 faster than it can compute a CRC-8, you should probably still use the CRC.


If you have far more space for the checksum – 128 bits, say – then the situation is different. A CRC-128 still has a theoretical advantage over a 128-bit random hash, but the false negative rate of the random hash (about 2−128) is already so low that it may as well be taken to be zero; there is no real benefit to making it any lower. If you can afford to use an MD5 hash in this situation then you may as well use it.

If you're trying to detect maliciously introduced errors then things become much more complicated. It's necessary in that situation to use some sort of cryptographic hash (not a CRC) but it's far from sufficient. If you really need to design a protocol that's safe against malicious interference then you should ask about it at the Cryptography Stack Exchange. Don't assume that using a modern hash like SHA-3 or BLAKE2 is enough to keep you safe. It likely isn't.

benrg
  • 1,395
  • 11
  • 13
5

CRC32 is faster and the hash is only 32bits long.

Use it when you just want a quick and light checksum. CRC is used in ethernet.

If you need more reliability it's preferable to use a modern hashing function.

François
  • 935
  • 1
  • 7
  • 11
5

Only use CRC if computation resources are very tight (i.e. some embed environments) or you need to store/transport many output values and space/bandwidth is tight (as CRCs are usually 32-bit where an MD5 output is 128-bit, SHA1 160 bit, and other SHA variants up to 512 bit).

Never use CRC for security checks as a CRC is very easy to "fake".

Even for accidental error detection (rather than malicious change detection) hashes are better than a simple CRC. Partly because of the simple way a CRC is calculated (and partly because CRC values are usual shorter than common hash outputs so have a much smaller range of possible values) it is much more likely that, in a situation where there are two or more errors, one error will mask another so you end up with the same CRC despite two errors.

In short: unless you have reason not to use a decent hash algorithm, avoid simple CRCs.

David Spillett
  • 1,401
  • 11
  • 21
  • 1
    CRC will catch all accidental data changes if you are using a proper polynomial. 1/2^32 changes are missed if exactly the right multiple bits are changed. – Gerhard Jun 17 '09 at 09:03
  • 1
    And with a proper polynomial it will also catch all errors of certain common classes, e.g. burst errors. – erikkallen Jun 17 '09 at 09:35
  • I'd agree with your answer except the question is about embedded systems. Performance of a cryptographic algorithm can be problematic on smaller embedded systems. – Craig McQueen Jun 29 '09 at 11:55
  • 2
    Would absolutely disagree with that. CRC error polynomials are carefully chosen so that they can provably detect 1,2,3,5 and burst errors up to something like 11 bits in some cases. A cryptographic hash is purely statisitical, so you have to use large digest values. 8-32 bits is unrealistic for a cryptographic hash digest as well as pointlessly expensive in cpu cyles and gates. Definitely not an answer to take on board if you work on embedded systems. The only time NOT to use a CRC is if you have to deal with an intelligent adversary scenario. – ilgitano Aug 15 '18 at 05:55
4

CRC32 is way faster and sometimes has hardware support (i.e. on Nehalem processors). Really, the only time you'd use it is if you're interfacing with hardware, or if you're really tight on performance

Ana Betts
  • 73,868
  • 16
  • 141
  • 209
0

CRC code is simpler and faster.

For what do you need any?

Macarse
  • 91,829
  • 44
  • 175
  • 230