When we want to get the hash of a big file in Python, with Python's hashlib
, we can process chunks of data of size 1024 bytes like this:
import hashlib
m = hashlib.md5()
chunksize = 1024
with open("large.txt", 'rb') as f:
while True:
chunk = f.read(chunksize)
if not chunk:
break
m.update(chunk)
print(m.hexdigest())
or simply ignore the splitting in chunks, like this:
import hashlib
sha256 = hashlib.sha256()
with open(f, 'rb') as g:
sha256.update(g.read())
print(sha256.hexdigest())
Finding a optimal implementation can be tricky, and would need some performance testing and improvements (1024 chunks? 4KB? 64KB? etc.), as detailed in Hashing file in Python 3? or Getting a hash string for a very large file
Question: Is there a cross-platform, ready-to-use, function to compute a MD5 or SHA256 of a big file, with Python? (such that we don't need to reinvent the wheel, or worry about the optimal chunk size, etc.)
Something like:
import hashlib
# get the result without having to think about chunks, etc.
hashlib.file_sha256('bigfile.txt')