2

I download very large (2–300GB) files from the internet. Files have digests, MD5, SHA256, ...

I use a Python script;

hd = requests.get('URL', stream=True)
out_file = open('Out.File', 'wb')
md5 = hashlib.md5()
for chunk in HD.iter_contents(chunk_size=256k):
    out_file.write(chunk)
    md5.update(chunk)

The downloading periodically crashes. In this situation I store part of file and continue downloading a later time. (Using {"Range": "bytes=XXX-'} tag in the get function.)

Can I save state of MD5 object and continue the computing when download is continued?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
TLaczy
  • 21
  • 1
  • Are you looking for something like the [pickle](https://docs.python.org/3/library/pickle.html) package? – 0x5453 Oct 25 '21 at 15:15
  • it may be a good attack route to also handle the crash in your program, does it raise some Exception? – ti7 Oct 25 '21 at 15:17
  • 2
    @0x5453 :( I tried pickle: `TypeError: cannot pickle '_hashlib.HASH' object` –  Oct 25 '21 at 15:19
  • 1
    this may be what you're looking for https://stackoverflow.com/questions/5865824/hash-algorithm-for-dynamic-growing-streaming-data (linked from https://stackoverflow.com/questions/2130892/persisting-hashlib-state ) – ti7 Oct 25 '21 at 15:29
  • I like the linked question from @ti7 but different algorithms will have different state. You could delay the hash until after you've downloaded the file (or do it inline until failure then fall back to post download). – tdelaney Oct 25 '21 at 15:36
  • The advantage to hashing the final product on disk is that it protects you from bugs in your own implementation of the downloader. Suppose your chunking got messed up somewhere and what made it to disk is different than what was hashed. – tdelaney Oct 25 '21 at 15:52
  • maybe first download full file and later create `md5` and use `md5.update()` - this way you don't have to save it. – furas Oct 26 '21 at 06:48

0 Answers0