5

My current approach is this:

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    with open(path, 'rb') as f:
         for block in iter(lambda: f.read(1024*func.block_size, b''):
             func.update(block)
    return func.hexdigest()

It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 @ 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?

EDIT: I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).

EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(

EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.

Another solution (~-0.5 sec, slightly faster) is to use os.open():

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    f = os.open(path, (os.O_RDWR | os.O_BINARY))
    for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
        func.update(block)
    os.close(f)
    return func.hexdigest()

Note that these results are not final.

Deneb
  • 981
  • 2
  • 9
  • 25
  • 2
    How fast is it to calculate the MD5 of this file using the `md5sum` tool? –  Mar 29 '14 at 16:57
  • @LutzHorn Since I'm not using a Linux/Gnu distribution at the moment, using Gnu's 32bit md5sum for Windows, it takes 8.5257151 seconds. – Deneb Mar 29 '14 at 17:32
  • 1
    So Python is not that bad :) –  Mar 29 '14 at 17:40
  • Try using `os.open()` if you aren't using it already. – martineau Mar 29 '14 at 17:52
  • @martineau Can you elaborate? – Deneb Mar 29 '14 at 18:02
  • You have to hunt a little bit to find it in the [os module's](http://docs.python.org/2/library/os.html#os.open) documentation, but it's a lower level version of the built-in `open()` function that returns a "file object" -- which sounds like some sort of wrapper -- so using the former might incur less overhead. – martineau Mar 29 '14 at 18:53
  • 1
    P.S. You'll also need to use `os.read()`. – martineau Mar 29 '14 at 18:56
  • Try block_size * 32768, which is 2MB. – greggo Mar 29 '14 at 20:28
  • @greggo It doesn't change much. – Deneb Mar 29 '14 at 22:10
  • There is no point in using hashfunc.block_size at all, it's a meaningless value that only exists as part of the APIs for legacy reasons. Just loop reading whatever size is efficient to read from disk for the purposes of your code and pass it to the hash function. As long as you read more than ~64KiB at a time you are unlikely to notice any measurable difference. – gps Mar 30 '14 at 07:10
  • @gps Actually there are noticeable differences. Increase or decrease the block_size substantially and your function will run either a few seconds faster or slower. – Deneb Mar 30 '14 at 17:33
  • My point was that the block_size attribute of the hash functions is entirely useless. You should not write code that uses it. Modifying it will do nothing. The only thing that matters is modifying an I/O buffer size. That has nothing to do with the hash functions internal block size. – gps Apr 04 '14 at 05:11
  • @gps, I know that. The thing is, I'm not modifying the block_size attribute of the hash function. I'm modifying the I/O buffer size by using the default block_size value (multiplied by some other value) as a parameter. – Deneb Apr 04 '14 at 11:54
  • My point is that the hash function "block_size" attribute is meaningless and shouldn't even be used for any purpose. Just pick an IO buffer size, don't attempt to base it off of block_size. – gps Apr 17 '14 at 17:15
  • @Deneb What processing time you are aiming at? It seems, like current processing time is close to what is technically possible. Optimizing without clear measurements (which you do) and target result (which I miss here) can soon become endless waste of time. – Jan Vlcinsky Apr 18 '14 at 16:07
  • @martineau The `open` function is being called once so replacing it with `os.open` will have literally zero effect. – Jeyekomon Sep 29 '22 at 12:55

1 Answers1

3

Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.

  • Using your first method required 21 seconds.
  • Reading the entire file (21 seconds) to buffer and then updating required 2 seconds.
  • Using the following function with a buffer size of 8096 required 17 seconds.
  • Using the following function with a buffer size of 32767 required 11 seconds.
  • Using the following function with a buffer size of 65536 required 8 seconds.
  • Using the following function with a buffer size of 131072 required 8 seconds.
  • Using the following function with a buffer size of 1048576 required 12 seconds.
def md5_speedcheck(path, size):
    pts = time.process_time()
    ats = time.time()
    m = hashlib.md5()
    with open(path, 'rb') as f:
        b = f.read(size)
        while len(b) > 0:
            m.update(b)
            b = f.read(size)
    print("{0:.3f} s".format(time.process_time() - pts))
    print("{0:.3f} s".format(time.time() - ats))

Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.

The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.

Tom de Geus
  • 5,625
  • 2
  • 33
  • 77
Lance Helsten
  • 9,457
  • 3
  • 16
  • 16