0

Possible Duplicate:
Probability of SHA1 collisions

Let's say I'm trying to identify duplicate files in a file system. Would it be safe to say that if the files' SHA1 checksums match, that they're identical? Should I also look through their contents if they match?

I've read that the theoretical complexity of attack is 2^51 hash function calls. I've also read on SO that "For SHA1, which outputs 160 bits, the birthday attack reduces the complexity to 2^80. This should be safe for 30 years or more." Should I still double check to make sure the file contents match? I jast want to make sure my assignment won't produce an erroneous output when it's run under a test script.

Community
  • 1
  • 1
Rose Perrone
  • 61,572
  • 58
  • 208
  • 243
  • well the odds are good that they are the same but if it was me then i would double check. why not include a timestamp too so you can doubly be sure. – nathan hayfield Oct 31 '12 at 21:45
  • 1
    See http://stackoverflow.com/questions/1867191/probability-of-sha1-collisions – paddy Oct 31 '12 at 21:47
  • 2
    If this is for an assignment, you can reasonably assume that a test script is designed to break the most naive program. I would take Frank's advice on adding extra bytes to the hash. If that is not allowed within the scope of the assignment, then I guess don't worry about it. The probability is ridiculous. 10^42 can sometimes fool you into thinking it's a small number until you put it in words: one tredecillion == one million trillion trillion trillion : how many *atoms* are estimated to exist in the observable universe? A bit less than the square of that number. Wrap your head around that! – paddy Oct 31 '12 at 21:59
  • I recommend switching to SHA-2 (SHA-256 specifically). There is no known way even for a powerful attacker to generate SHA-2 collisions. For SHA-1 accidental collisions can be neglected, deliberate/malicious collisions are possible. – CodesInChaos Nov 02 '12 at 12:26

1 Answers1

2

There's a 1 in 2^160 chance that two given messages have the same hash (since SHA-1 produces a 160-bit hash).

Even if you have a million entries in your fileSystem, that's still a 1 in 10^42 chance that a new entry will share the same hash.

SHA-1 has proved to be fairly good, so I don't think you need to worry about collisions at all. If you need more you can add some quality attributes like a timestamp, filesize ..

Frank
  • 16,476
  • 7
  • 38
  • 51
  • Birthday paradox fail. If you have a million entries, there's 1000000^2-1 opportunities for a collision, not 1000000 - making the odds of a collision a probability of about 1 in 10^36, not 10^42. – Nick Johnson Nov 01 '12 at 10:59
  • To address the birthday paradox : Even with a Million entries... no worries... – Frank Nov 01 '12 at 18:37
  • A million isn't that many. All I was pointing out is that your math is wrong. – Nick Johnson Nov 02 '12 at 09:07