5

I need a smart copy function for reliable and fast file copying & linking. The files are very large (from some gigabytes to over 200GB) and distributed over a lot of folders with people renaming files and maybe folders during the day, so I want to use hashes to see if I've copied a file already, maybe under a different name, and only create a link in that case.

Im completely new to hashing and I'm using this function here to hash:

import hashlib

def calculate_sha256(cls, file_path, chunk_size=2 ** 10):
    '''
    Calculate the Sha256 for a given file.

    @param file_path: The file_path including the file name.
    @param chunk_size: The chunk size to allow reading of large files.
    @return Sha256 sum for the given file.
    '''
    sha256 = hashlib.sha256()
    with open(file_path, mode="rb") as f:
        for i in xrange(0,16):
            chunk = f.read(chunk_size)
            if not chunk:
                break
            sha256.update(chunk)
    return sha256.hexdigest()

This takes one minute for a 3GB file, so in the end, the process might be very slow for a 16TB HD.

Now my idea is to use some additional knowledge about the files' internal structure to speed things up: I know they contain a small header, then a lot of measurement data, and I know they contain real-time timestamps, so I'm quite sure that the chance that, let's say, the first 16MB of two files are identical, is very low (for that to happen, two files would need to be created at exactly the same time under exactly the same environmental conditions). So my conclusion is that it should be enough to hash only the first X MB of each file.

It works on my example data, but as I'm unexperienced I just wanted to ask if there is something I'm not aware of (hidden danger or a better way to do it).

Thank you very much!

Blutkoete
  • 408
  • 3
  • 12
  • 2
    http://codereview.stackexchange.com/ – vaultah Jun 24 '14 at 14:43
  • 1
    You either need to do the math and see how likely it is that you have an unintended collision, or you need to somehow guarantee that different files *always have* a different header. In the latter case, you can *for sure* checksum only the header. In the former case, you need to decide on your own if the likelihood for collision is something you can live with or not. It is difficult to help without knowing your data. – Dr. Jan-Philip Gehrcke Jun 24 '14 at 14:47
  • 1
    You might avoid reinventing the wheel and use [rsync](http://en.wikipedia.org/wiki/Rsync). – John Kugelman Jun 24 '14 at 14:51
  • I'll switch to md5 then and incorporate information from the link you provided to improve my code, thank you very much. The code in the answers to that question looks so striking similar to my result that I'm wondering whether I once read that one some time ago and only forgot about it. Checking the header should be ok - when I think about it, I'll just increase the amount of data the hash is calculated upon if I get a collision until either (1) the hashes start to differ or (2) the files end. That should be the solution :). – Blutkoete Jun 24 '14 at 14:57
  • If I recall correctly, rsync won't create a link if I have the same file in different folders. – Blutkoete Jun 24 '14 at 15:09

1 Answers1

3

You can get the MD5 hash of large files, by breaking them into small byte chunks.

Also, calculating MD5 hashes is significantly faster than SHA-256 and should be favored for performance reasons for any application that doesn't rely on the hash for security purposes.

Community
  • 1
  • 1
Alex W
  • 37,233
  • 13
  • 109
  • 109
  • I took all the comments and answers, the provided links and now I'm using md5, hash the first 16MB, and if I spot two files with the same hash, I recalculate their hash for 32MB, then for 64MB, then ..., until either the hashes start to differ, one file reports EoF but the other doesn't (considering those two cases "not equal") or both files report EoF and the hash is the same (considering the files equal). Thank you all! – Blutkoete Jun 25 '14 at 07:35