6

This is in reference to Get MD5 hash of big files in Python and Hashlib in Windows and Linux

In responses to both these questions, it is advised to use larger chunks of data in the function md5.update() to improve performance.

All testing I have done appears to indicate that use of smaller chunks gives the best performance.

Consider the following code:

def test(factor):
    filehash = hashlib.md5()
    blk_size_to_read = filehash.block_size * (2**factor)
    with open(largetestfile, 'rb') as f:
        read_data = f.read(blk_size_to_read)
        filehash.update(read_data)
    filehash.digest()

if __name__ == '__main__':
    for ctr in xrange(0, 12):
        funcstr = "test({})".format(str(ctr))
        timetaken = timeit.timeit(funcstr, setup="from __main__ import test", number = 5000)
        print "Factor: {} Time: {}".format(str(ctr), str(timetaken))

All tests I have done indicate that the best performance is achieved when using factor 0 or 1 (that is, 64 or 128 bytes).

Any reason why I am seeing different results from those indicated in the questions quoted?

I have tried binary and plain text files with size ranging from 700MB to 1.2GB and am using Python 2.7.3 on Ubuntu 12.04

Secondary question: am I using timeit the way it should be?

Community
  • 1
  • 1
Verma
  • 956
  • 6
  • 21

1 Answers1

7

Found the error! I was reading just one chunk and then doing nothing!

Changed

with open(largetestfile, 'rb') as f:
    read_data = f.read(blk_size_to_read)
    filehash.update(read_data)

to

with open(testfile, 'rb') as f:
    while (True):
        read_data = f.read(blk_size_to_read)
        if not read_data:
            break
        filehash.update(read_data)

to fix the issue.

UPDATE:

I ran a slightly modified version of the program above to establish the best possible size of buffer to be used when incrementally using update() to find the hash of a given file. I also wanted to establish whether there was any benefit in incremental hashing rather than calculating the hash of the file in one go (other than memory constraints).

I created 20 files (with random data) for this with file size starting from 4096 bytes and upto 2.1 GB. md5 hash for each of these files was calculated using buffer sizes starting 2**6 bytes (64 bytes - block size) upto 2**20 bytes. Using timeit each of these was run 100 times and execution times obtained with the shortest execution time being recorded. Execution time for hash calculation of the entire file in one go was was also recorded.

The results are as follows...

FileName           Filesize       Chunksize      Chunked Time   Complete Time       %diff
file5.txt                 4096           4096      0.0014789      0.0014701         -0.60%
file6.txt                 8192         524288      0.0021310      0.0021060         -1.19%
file7.txt                16384          16384      0.0033200      0.0033162         -0.12%
file8.txt                32768          65536      0.0061381      0.0057440         -6.86%
file9.txt                65536          65536      0.0106990      0.0112500          4.90%
file10.txt              131072         131072      0.0203800      0.0206621          1.37%
file11.txt              262144         524288      0.0396681      0.0401120          1.11%
file12.txt              524288        1048576      0.0780780      0.0787551          0.86%
file13.txt             1048576        1048576      0.1552539      0.1564729          0.78%
file14.txt             2097152         262144      0.3101590      0.3167789          2.09%
file15.txt             4194304          65536      0.6295781      0.6477270          2.80%
file16.txt             8388608         524288      1.2633710      1.3030031          3.04%
file17.txt            16777216         524288      2.5265670      2.5925691          2.55%
file18.txt            33554432          65536      5.0558681      5.8452392         13.50%
file19.txt            67108864          65536     10.1133211     11.6993010         13.56%
file20.txt           134217728         524288     20.2226040     23.3923230         13.55%
file21.txt           268435456          65536     40.4060180     46.6972852         13.47%
file22.txt           536870912          65536     80.9403431     93.4165111         13.36%
file23.txt          1073741824         524288    161.8108051    187.1303582         13.53%
file24.txt          2147483648          65536    323.4812710    374.3899529         13.60%

The Chunked Time is execution time when the file is broken into chuck and hased incrementally; the Complete Time is execution time when the entire file is hashed in one go. The %diff is the percentage difference between Chunked Time and 'Complete Time'.

Observations:

  1. For smaller file sizes the chunk size is nearly always equal to file size and there appears to be no advantage in adopting either approach.
  2. For larger files (33554432 (2**25) bytes and above), there appears to be considerable performance benefit (lesser time) in using the incremental approach rather than hashing the entire file in one go.
  3. For larger files it the best chunk/buffer size is 65536 (2**16) bytes

Notes: python 2.7.3; Ubuntu 12.06 64 bit; 8 Gigs of RAM The code used for this is available here... http://pastebin.com/VxH7bL2X

Verma
  • 956
  • 6
  • 21
  • 1
    For curiosity's sake, can you tell us what you found to be the optimal chunk size? – 2rs2ts Jul 19 '13 at 19:38
  • The performance will asymptotically increase towards the maximum theoretical thruput of your system running the md5 code as your chunk size increases. By the time you're buffering 1MiB any increase in speed has long since become irrelevant. If you want to pick an arbitrary buffer size I suggest 128k. This is true of all hash functions. – gps Jul 19 '13 at 20:30
  • 1
    @2rs2ts the optimal size is 65536 bytes. See update to my answer above. – Verma Jul 21 '13 at 12:57
  • this is awesome - thank you Verma for writing it. I've made a Python 3 version of this that folks are welcome to use if they want to benchmark on newer systems. https://pastebin.com/KCQSTfhV – Peter Gaultney May 09 '23 at 23:35
  • In my own tests, I would agree with @gps that the time-to-hash-per-byte does not go down appreciably past 65536 bytes regardless of how large the file is. – Peter Gaultney May 10 '23 at 00:04