There is some good discussion on Stack Overflow of computing hashes of files, including files too large to fit in memory, in Python (1, 2, 3). They end up with solutions that look something like this (lightly edited from #3):
def md5_file(path, size):
m = hashlib.md5()
with open(path, 'rb') as f:
b = f.read(size)
while len(b) > 0:
m.update(b)
b = f.read(size)
return m.digest()
If you don't need your hash function to be cryptographically secure (which I don't) then there's pyfasthash (seemingly aka pyhash), as discussed here. Unfortunately, the pyfasthash's hash classes lack the update
method used above. I haven't had much luck figuring out what else to do; the Python-C code mixture is beyond me. I'm just reading the whole file into memory like this:
with open(path, 'rb') as afile:
return hasher(afile.read())
The disadvantages of this approach are:
- The file has to fit in memory
- It's slower. According to #3 you want the amount of your file you load into memory at once to be small enough to avoid page swaps (about 64KiB on the poster's machine).
Is there any way I can calculate hashes of my files more quickly?