Can I continue computing MD5 hash later in Python?

Question

I download very large (2–300GB) files from the internet. Files have digests, MD5, SHA256, ...

I use a Python script;

hd = requests.get('URL', stream=True)
out_file = open('Out.File', 'wb')
md5 = hashlib.md5()
for chunk in HD.iter_contents(chunk_size=256k):
    out_file.write(chunk)
    md5.update(chunk)

The downloading periodically crashes. In this situation I store part of file and continue downloading a later time. (Using {"Range": "bytes=XXX-'} tag in the get function.)

Can I save state of MD5 object and continue the computing when download is continued?

Are you looking for something like the [pickle](https://docs.python.org/3/library/pickle.html) package? — 0x5453, Oct 25 '21 at 15:15
it may be a good attack route to also handle the crash in your program, does it raise some Exception? — ti7, Oct 25 '21 at 15:17
@0x5453 :( I tried pickle: `TypeError: cannot pickle '_hashlib.HASH' object` — , Oct 25 '21 at 15:19
this may be what you're looking for https://stackoverflow.com/questions/5865824/hash-algorithm-for-dynamic-growing-streaming-data (linked from https://stackoverflow.com/questions/2130892/persisting-hashlib-state ) — ti7, Oct 25 '21 at 15:29
I like the linked question from @ti7 but different algorithms will have different state. You could delay the hash until after you've downloaded the file (or do it inline until failure then fall back to post download). — tdelaney, Oct 25 '21 at 15:36
The advantage to hashing the final product on disk is that it protects you from bugs in your own implementation of the downloader. Suppose your chunking got messed up somewhere and what made it to disk is different than what was hashed. — tdelaney, Oct 25 '21 at 15:52
maybe first download full file and later create `md5` and use `md5.update()` - this way you don't have to save it. — furas, Oct 26 '21 at 06:48

Can I continue computing MD5 hash later in Python?

0 Answers0