Found the error! I was read
ing just one chunk and then doing nothing!
Changed
with open(largetestfile, 'rb') as f:
read_data = f.read(blk_size_to_read)
filehash.update(read_data)
to
with open(testfile, 'rb') as f:
while (True):
read_data = f.read(blk_size_to_read)
if not read_data:
break
filehash.update(read_data)
to fix the issue.
UPDATE:
I ran a slightly modified version of the program above to establish the best possible size of buffer to be used when incrementally using update() to find the hash of a given file. I also wanted to establish whether there was any benefit in incremental hashing rather than calculating the hash of the file in one go (other than memory constraints).
I created 20 files (with random data) for this with file size starting from 4096 bytes and upto 2.1 GB. md5 hash for each of these files was calculated using buffer sizes starting 2**6
bytes (64 bytes - block size) upto 2**20
bytes. Using timeit each of these was run 100 times and execution times obtained with the shortest execution time being recorded. Execution time for hash calculation of the entire file in one go was was also recorded.
The results are as follows...
FileName Filesize Chunksize Chunked Time Complete Time %diff
file5.txt 4096 4096 0.0014789 0.0014701 -0.60%
file6.txt 8192 524288 0.0021310 0.0021060 -1.19%
file7.txt 16384 16384 0.0033200 0.0033162 -0.12%
file8.txt 32768 65536 0.0061381 0.0057440 -6.86%
file9.txt 65536 65536 0.0106990 0.0112500 4.90%
file10.txt 131072 131072 0.0203800 0.0206621 1.37%
file11.txt 262144 524288 0.0396681 0.0401120 1.11%
file12.txt 524288 1048576 0.0780780 0.0787551 0.86%
file13.txt 1048576 1048576 0.1552539 0.1564729 0.78%
file14.txt 2097152 262144 0.3101590 0.3167789 2.09%
file15.txt 4194304 65536 0.6295781 0.6477270 2.80%
file16.txt 8388608 524288 1.2633710 1.3030031 3.04%
file17.txt 16777216 524288 2.5265670 2.5925691 2.55%
file18.txt 33554432 65536 5.0558681 5.8452392 13.50%
file19.txt 67108864 65536 10.1133211 11.6993010 13.56%
file20.txt 134217728 524288 20.2226040 23.3923230 13.55%
file21.txt 268435456 65536 40.4060180 46.6972852 13.47%
file22.txt 536870912 65536 80.9403431 93.4165111 13.36%
file23.txt 1073741824 524288 161.8108051 187.1303582 13.53%
file24.txt 2147483648 65536 323.4812710 374.3899529 13.60%
The Chunked Time
is execution time when the file is broken into chuck and hased incrementally; the Complete Time
is execution time when the entire file is hashed in one go. The %diff
is the percentage difference between Chunked Time and 'Complete Time'.
Observations:
- For smaller file sizes the chunk size is nearly always equal to file size and there appears to be no advantage in adopting either approach.
- For larger files (33554432 (
2**25
) bytes and above), there appears to be considerable performance benefit (lesser time) in using the incremental approach rather than hashing the entire file in one go.
- For larger files it the best chunk/buffer size is 65536 (
2**16
) bytes
Notes: python 2.7.3; Ubuntu 12.06 64 bit; 8 Gigs of RAM
The code used for this is available here... http://pastebin.com/VxH7bL2X