2

That question has already been asked and answered a number of times on this site but for some vague reason no one came up with a relatively simple (in my opinion), more concise and probably more elegant solution. Perhaps because the solution is actually bad, but that's what I'm trying to figure out, if it's bad then I'd like to know how and why. One of the most popular answers was this:

def md5(fname):
  hash_md5 = hashlib.md5()
  with open(fname, "rb") as f:
    for chunk in iter(lambda: f.read(4096), b""):
        hash_md5.update(chunk)
  return hash_md5.hexdigest()

It's understandable - we don't want to load the whole file into memory so we read it in chunks with the help of an iterator and lambda function. Nice and simple. But presumably we could do this in a simplier way by defining the md5sum function as follows:

def md5sum(fname):
  md5 = hashlib.md5()
  with open(fname, 'rb') as f:
    for chunk in f:
      md5.update(chunk)
  return md5.hexdigest()

Conveniently, iterating over an open file handle gives us a sequence of its lines, so we could use the 'b' prefix in open(fname, 'rb') to iterate over a bytes object. What's wrong about doing that?

weeCoder
  • 272
  • 2
  • 10
  • 3
    It's probably a matter of which kind of files you're dealing with, in particular whether it is actually an ascii file or binary. The original version gives more control over the chunk size, while your version is to the mercy of expecting line feeds here and there. Also, to handle large files, I'd not just use 4K of data at a time, but at least 100K, just to make sure there is no significant overhead in the "chunking". I had that experience once with the zip module, and 100K is nothing today. – Dr. V Nov 04 '16 at 08:31
  • @Dr.V I see, well I agree with almost everything you say, but as far as I can see, it works well for all sorts of files. – weeCoder Nov 04 '16 at 08:45
  • 1
    @weeCoder Try to create a huge file that does *not* contain the `\x0a` byte and see... the `for chunk in f` degrades to `chunk = f.read()` reading the whole file into memory. – Bakuriu Nov 04 '16 at 08:54
  • @PM2Ring Oh thanks, that pretty much answers my question. If you think it would be of any use to others, you should post it as an answer. – weeCoder Nov 04 '16 at 08:54

1 Answers1

2

What Dr. V said in the comments is correct.

Using for chunk in f: operates on chunks that end in b'\n' == b'\x0A'. That makes the chunk size very small for text files, and totally unpredictable for typical binary files: a binary file may not contain any 0A bytes. When that happens for chunk in f: simply reads the whole file into a single chunk.

That 4k chunk size should be ok, but you could try a chunk size of 64k or 128k to see if that improves the speed. In simple data copying tests (using dd) I've found little benefit in using larger chunk sizes; bear in mind that modern OSes are good at file buffering & caching. OTOH, I'm running a rather old 32 bit single core machine.

On the topic of hashing large files, you may be interested in a program I wrote that uses the OpenSSL crypto library to perform a SHA256 hash on large files. The feature of this program is that it's resumable: you can stop it at any time and when you restart it it will continue the hashing process.

And here's one that uses hashlib to compute the MD5 and SHA256 hashes of a file simultaneously.

Community
  • 1
  • 1
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182