50

Can CRC32 be used as a hash function? Any drawbacks to this approach? Any tradedeoffs?

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Pradyot
  • 2,897
  • 7
  • 41
  • 58
  • 5
    Already seems to be asked. http://stackoverflow.com/questions/2694740/can-one-construct-a-good-hash-function-using-crc32c-as-a-base?rq=1 – Pradyot Jun 08 '12 at 18:15
  • 2
    That depends on what you want to use the hash for. – Gumbo Jun 08 '12 at 18:15
  • For some subset of the set hash, yes. However it's not a block code it's a stream code. For very small blocks it's quicker to use a table. – starbolin Jun 08 '12 at 18:23
  • See also: [How is a CRC32 checksum calculated?](https://stackoverflow.com/questions/2587766/how-is-a-crc32-checksum-calculated), and [hash function for string](https://stackoverflow.com/q/7666509/4561887) – Gabriel Staples Feb 22 '22 at 16:51

3 Answers3

53

CRC32 works very well as a hash algorithm. The whole point of a CRC is to hash a stream of bytes with as few collisions as possible. That said, there are a few points to consider:

  • CRC's are not secure. For secure hashing you need a much more computationally expensive algorithm.

  • Different CRC flavors exist with different properties. Make sure you use the right algorithm, e.g. with hash polynomial 0x11EDC6F41 (CRC32C) which is the optimal general purpose choice.

  • As a hashing speed/quality trade-off, the x86 CRC32 instruction is tough to beat. However, this instruction doesn't exist in older CPU's so beware of portability problems.

---- EDIT ----

Mark Adler provided a link to a useful article for hash evaluation by Bret Mulvey. Using the source code provided in the article, I ran the "bucket test" for both CRC32C and Jenkins96. These tables show the probability that a truly uniform distribution would be worse than the measured result by chance alone. So, higher numbers are better. The author considered 0.05 or lower to be weak and 0.01 or lower to be very weak. I'm entirely trusting the author on all this and am just reporting results.

I placed an * by all the instances where CRC32C performed better than Jenkins96. By this simple tally, CRC32C was a more uniform hash than Jenkins96 54 of 96 times. Especially if you can use the x86 CRC32 instruction, the speed quality trade-off is excellent.

CRC32C (0x1EDC6F41)

       Uniform keys        Text keys         Sparse keys

Bits  Lower    Upper     Lower    Upper     Lower    Upper
 1    0.671   *0.671    *1.000    0.120    *0.572   *0.572
 2   *0.706   *0.165    *0.729   *0.919     0.277    0.440
 3   *0.878   *0.879    *0.556    0.362    *0.535   *0.542
 4    0.573    0.332     0.433    0.462    *0.855    0.393
 5    0.023   *0.681     0.470    0.907     0.266    0.059
 6   *0.145   *0.523     0.354   *0.172    *0.336    0.588
 7    0.424    0.722     0.172   *0.736     0.184   *0.842
 8   *0.767    0.507    *0.533    0.437     0.337    0.321
 9    0.480    0.725    *0.753   *0.807    *0.618    0.025
10   *0.719    0.161    *0.970   *0.740    *0.789    0.344
11   *0.610    0.225    *0.849   *0.814    *0.854   *0.003
12   *0.979   *0.239    *0.709    0.786     0.171   *0.865
13   *0.515    0.395     0.192    0.600     0.869   *0.238
14    0.089   *0.609     0.055   *0.414    *0.286   *0.398
15   *0.372   *0.719    *0.944    0.100    *0.852   *0.300
16    0.015   *0.946    *0.467    0.459     0.372   *0.793

And for Jenkins96, which the author of article considered to be an excellent hash:

Jenkins96

      Uniform keys         Text keys         Sparse keys

Bits  Lower    Upper     Lower    Upper     Lower    Upper
 1    0.888    0.572     0.090    0.322     0.090    0.203
 2    0.198    0.027     0.505    0.447     0.729    0.825
 3    0.444    0.510     0.360    0.444     0.467    0.540
 4    0.974    0.783     0.724    0.971     0.439    0.902
 5    0.308    0.383     0.686    0.940     0.424    0.119
 6    0.138    0.505     0.907    0.103     0.300    0.891
 7    0.710    0.956     0.202    0.407     0.792    0.506
 8    0.031    0.552     0.229    0.573     0.407    0.688
 9    0.682    0.990     0.276    0.075     0.269    0.543
10    0.382    0.933     0.038    0.559     0.746    0.511
11    0.043    0.918     0.101    0.290     0.584    0.822
12    0.895    0.036     0.207    0.966     0.486    0.533
13    0.290    0.872     0.902    0.934     0.877    0.155
14    0.859    0.568     0.428    0.027     0.136    0.265
15    0.290    0.420     0.915    0.465     0.532    0.059
16    0.155    0.922     0.036    0.577     0.545    0.336

---- EDIT ---- Fixed out-of-date links and minor cleanup.

srking
  • 4,512
  • 1
  • 30
  • 46
  • 1
    @Mark, The author did not using the CRC32C polynomial. CRC32C works just fine as a hash for bucketizing strings of bytes in his test program. – srking Jun 10 '12 at 22:36
  • 2
    Good research! +1. However I still don't think that even with a crc32 instruction, it will beat hash algorithms designed for the purpose of (non-cryptographic) hashing. You can find some more advanced hash algorithm development and testing here: http://code.google.com/p/smhasher/ . – Mark Adler Jun 10 '12 at 23:19
  • @Mark, thanks for another good link. I don't know about the hash quality, but the author quotes ~29 clocks per 16 bytes for murmerhash3, vs. CRC32C instruction which is ~6 clocks per 16 bytes. I'm sticking to my story that CRC32C instruction is the sweet spot. – srking Jun 10 '12 at 23:39
  • 1
    That is darned fast, and may work just fine, depending on the application for the hash. All CRCs, regardless of the polynomial, dramatically fail the avalanche hash test. See this page from the first link I provided: http://home.comcast.net/~bretm/hash/8.html . – Mark Adler Jun 11 '12 at 05:09
  • 6
    Just as a sidenote, Bret Mulvey moved that site some months ago to: http://bretmulvey.com/hash/ – Nico Erfurth Aug 27 '14 at 10:28
  • the Intel CRC32 instruction has a latency of 3 and throughput of 1 and can take 8 bytes at time (if I understood right the documentation). That means that you can get 8 bytes per cycle (if you have sufficiently long input). For smaller outputs 8 bytes = 4 cycles (3 + 1) , 16 bytes = 5 cycles (3 +1), 24 bytes = 6 (3+3), and so on – RubenLaguna Jul 02 '15 at 13:00
  • after researching a little bit more the 3 cycles latency means that if the next CRC32 instruction depends on the output of the first then it has to wait. That's why you often see that the input is splitted into 3 chunks and all the chunk processed in parallel, so that those 3 instruction don't depend on each other and therefore the pipeline can be kept full. – RubenLaguna Jul 02 '15 at 21:51
  • SMHasher now has a fork on Github: https://github.com/rurban/smhasher Interestingly, CRC is mentioned in the section with problematic hash functions with this note: "insecure, 100% bias, collisions, distrib". I'm not sure what that means :) – Massimiliano May 16 '16 at 18:37
  • 1
    @NicoErfurth: Thanks from the future, the Web Archive didn't snag the page before it disappeared. – i336_ May 04 '17 at 10:50
  • 2
    Still no. Both CRC-32 and CRC-32C fail the avalanche test dramatically. – Mark Adler Nov 18 '18 at 17:03
  • 2
    The page has moved again and low lives at https://papa.bretmulvey.com/post/124027987928/hash-functions – Lorenz Sep 30 '20 at 11:35
  • 1
    I can support that crc32 (including c) is rather bad in terms of collisions: I have just done some tests of my hashtable on a 150k word english dictionary and 300k capacity, and crc32 gave on average 1140 collisions per hash, whereas fnv-1a only gave 1.5 – abel1502 Apr 10 '21 at 19:14
  • @srking, please check my edit and add/update the links. – Gabriel Staples Feb 22 '22 at 16:49
  • 1
    @GabrielStaples Done, I think. – srking Feb 22 '22 at 22:13
8

I don't know why Mark Adler said that "crc32 poorly distributes the input bits to hash". There is no single bit in the crc32 hash that is exactly equal to the input bits. Any bit of the hash is a linear combination of the input bits. Secondly, crc always evenly map the same number of different of input sequences to a given hash value. For example, if you have 1000 bits long message, after crc32, you can always find 2^(1000-32) sequences that produce a given hash value, no more, no less.

If you do not need the security feature, crc can serve as hash perfectly.

Actually, I think other non-secure hash functions may be simpler than crc, if you need to have a longer crc, for example crc-256.

Heng Tang
  • 89
  • 1
  • 1
  • 1
    I believe he said that because CRC fails statistical randomness tests - uniformly distributed across the code range, no bias toward certain bits. – bryc Dec 05 '17 at 12:56
1

CRC32 maps bytes to 32-bit integers, before accumulating them with xor. That means each byte affects only 8 out of 32 bits in your hash. Of course CRC32 does shifting too, but it only hides the problem under the rug. I.e. it will distribute keys unevenly, there will be heavy clustering at some region. It may appear such hash works fine, until you hit that region, and suddenly your O(1) hash table turns into the O(n) one.

CRC32 was designed for detecting damaged files, not hashing. And as Mark mentioned it wont protect your files from modification, since hackers can still modify them at will by just inserting a properly crafted 32bit value after the change.

Nash Gold
  • 53
  • 6