2

I'm writing a Python program to find duplicate files. Calculating MD5 and comparing file-size is not 100% fool-proof since it is possible for two different files to have the same file-size and MD5 (collision probability 2^128).

I was wondering then, perhaps if I add another hash such as SHA1 (2^160) or CRC32 (2^32) to the mix, would that greatly increase the ability to identify unique files, i.e. compare both MD5 and SHA1 of a file for uniqueness? Is SHA1 or CRC32 preferable for this secondary check?

If so, how do I calculate both MD5 and SHA1/CRC32 simultaneously while iterating thru 1MB blocks of a very large file to avoid reading the large file twice? This is what I have for MD5:

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(2 ** 20), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Currently my code, after checking for identical MD5's, would run filecmp.cmp on both files as the final check. This is obviously resource intensive and not efficient.

I'm using SQLite to store these hashes. I assume it is slower than Python Lists but doesn't suffer from insufficient memory when processing billions of files.

ebann
  • 181
  • 6
  • 2^128 is misleading but the actual probability of a collision is small enough that you don't need to worry, MD5 is fine (for non-sensitive applications) https://stackoverflow.com/questions/4032209/is-md5-still-good-enough-to-uniquely-identify-files / https://crypto.stackexchange.com/questions/12677/strength-of-md5-in-finding-duplicate-files/12679 – Alex K. Jun 28 '17 at 13:36
  • Thanks for the comment. "Small enough" probability won't be enough for someone with 500+GB of photo images to sort thru and eliminate duplicates. Wondering if SHA1/CRC32 is enough to "guarantee" that. Perhaps filecmp.cmp is the only fool-proof way... – ebann Jun 28 '17 at 13:40
  • Using two hash functions should give you enough certainty to sleep peacefully ;) – Maciek Jun 28 '17 at 13:45

1 Answers1

4

You already have the hardest part done. You just have to feed the chunks you read to another hasher:

def calculate_hashes(fname):
    hash_md5 = hashlib.md5()
    hash_sha1 = hashlib.sha1()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(2 ** 20), b""):
            hash_md5.update(chunk)
            hash_sha1.update(chunk)
    return hash_md5.hexdigest(), hash_sha1.hexdigest()
Maciek
  • 3,174
  • 1
  • 22
  • 26
  • Thanks for the code snippet! Will implement it and time the results to see how it compares with using filecmp.cmp. Should be much faster... – ebann Jun 28 '17 at 13:42