62

I'm building a system which needs to be able to find if blobs of bytes have been updated. Rather than storing the whole blob (they can be up to 5MBs), I'm thinking I should compute a checksum of it, store this and compute the same checksum a little bit later, to see whether the blob has been updated.

The goal is to minimize the following (in that order) :

  • size of the checksum
  • time to compute
  • likeliness of collisions (2 identical checksums happening even if the content has been modified).

It is acceptable for our system to have collision not more than 1/1,000,000. The concern is not security, but simply update/error detection, so rare collisions are ok. (Which is why I put it last in the things to minimize).

Also, we cannot modify the blobs of text ourselves.

Of course, md5, crc or sha1 come to mind, and if I wanted a quick solution, I'd go for it. However, more than a quick solution, I'm looking for what could be a comparison of different methods as well as the pros and cons.

Pang
  • 9,564
  • 146
  • 81
  • 122
Julien Genestoux
  • 31,046
  • 20
  • 66
  • 93
  • What is your concern, here? Are you simply checking to see whether your data blobs have changed since some earlier time, or are you trying to detect a malicious alteration? – dajames Nov 20 '10 at 14:22
  • Just trying to see if there has been any update in them. – Julien Genestoux Nov 20 '10 at 14:24
  • 1
    If you're not concerned about the possibility of malicious alteration but just want to track changes, and if (as you say elsewhere) you can live with an accidental collision probablility of one in a million, then go with CRC -- it's faster than MD5 or SHA and the chance of collisions is *well* within your spec. – dajames Nov 20 '10 at 15:33
  • 1
    I would recommend CRC-64. It is much much faster than cryptographic hashes and should meet your requirements for collision probability. – President James K. Polk Nov 20 '10 at 17:14

2 Answers2

32

I suggest you have a look to this SO page, CRC vs MD5/SHA1.
Speed and collisions are discussed in this other thread.
And as always Wikipedia is your friend.

If I had to choose, there is an important question to answer: do you want that in any case there is no collision - or, at least, that the probability is so low that it is close to the chance that the Moon collides with Earth within the next 5 minutes?

If yes, choose the SHA family.
In your case I would change the way the update check is being done.
For instance, an incremental number could be associated with the blob, and be sent instead of the hash, the request for update would be required if the number is different on the other side. The collision probability in this case goes from ~10^-18 to ~0 (basically 0 + bug probability )...

Edit following comments

Found this algorithm, Adler-32, which is good for long messages (MB) with a CRC of 32 bits, i.e. about ~1/10^9 (MD5 is 128 bits long).
It is fast to calculate.
Adler-32. There is some come sample (link) at the bottom.

Mark Adler
  • 101,978
  • 13
  • 118
  • 158
Déjà vu
  • 28,223
  • 6
  • 72
  • 100
  • I don't mind very rare collisions. On top of my head, something like 1/1,000,000 seems low enough (we'll be comparing the blobs on average every 15minutes, so that's one collision every 28k years. Also, I don't control the blobs of text, so I can't alter them myself. – Julien Genestoux Nov 20 '10 at 14:27
  • 1
    In this case you better go for MD5, faster than SHA but more collision-prone (probability being close to your requirement). – Déjà vu Nov 20 '10 at 14:33
  • but MD5 is 32bits, which is quite big and collision probabilty is much lower than 1/1,000,000... so I don't think it's a good candidate! We can do better! – Julien Genestoux Nov 20 '10 at 14:35
  • Ha, yes I meant Bytes :( Adler-32 loks interesting. – Julien Genestoux Nov 20 '10 at 15:33
  • 1
    note that when people say Alder-32 is good for long messages that means it is problematic for short messages. see this discussion about file system corruption issues where alder-32 is singled out as a weakness https://blog.acolyer.org/2017/03/08/redundancy-does-not-imply-fault-tolerance-analysis-of-distributed-storage-reactions-to-single-errors-and-corruptions/ – simbo1905 Mar 10 '17 at 15:20
  • Just a note on the computational cost of MD5 vs SHA1; on modern hardware with OpenSSL, SHA1 is often faster to compute. See this answer to another question that addresses this: https://stackoverflow.com/a/15324732/2781687 – Jason Mar 13 '23 at 10:43
2

Blake2 is the fastest hash function you can use and that is mainly adopted:

BLAKE2 is not only faster than the other good hash functions, it is even faster than MD5 or SHA-1 Source

Winner of SHA-3 contest was Keccak algorithm but is not yet has a popular implementation is not adopted by default in GNU/Linux distributions. Instead, Blake2 that was a SHA-3 contest candidate is faster than Keccak and is part of GNU coreutils. So on you GNU/Linux distribution you can use b2sum to use Blake2 hash algorithm.

noraj
  • 3,964
  • 1
  • 30
  • 38