2

I'm using the following code to get a MD5 hash for several files with an approx. total size of 1GB:

md5 = hashlib.md5()
with open(filename,'rb') as f: 
    for chunk in iter(lambda: f.read(128*md5.block_size), b''): 
        md5.update(chunk)
fileHash = md5.hexdigest()

For me, it's getting it pretty fast as it takes about 3 seconds to complete. But unfortunately for my users (having an old PC's), this method is very slow and from my observations it may take about 4 minutes for some user to get all of the file hashes. This is a very annoying process for them, but at the same I think this is the simplest & fastest way possible - am I right?

Would it be possible to speed-up the hash collecting process somehow?

Lucas
  • 3,517
  • 13
  • 46
  • 75

1 Answers1

3

I have a fairly weak laptop as well, and I just tried it - I can md5 one GB in four seconds as well. To go to several minutes, I suspect it's not the calculation but reading the file from hard disk. Try reading 1 MB blocks, i.e., f.read(2**20). That should need far fewer reads and increase the overall reading speed.

Stefan Pochmann
  • 27,593
  • 8
  • 44
  • 107
  • Btw. I have tried it on my HDD hard disk, first run took about 12 seconds to complete, while second run only 2.3 seconds~. Could this be cached somewhere? – Lucas May 12 '15 at 01:44
  • The 2.3 seconds was almost certainly thanks to the file being in RAM. I doubt you have an HDD that can read 434 MB/s. I'm not sure that even exists. Not sure about the 12 seconds. You could try rebooting your PC or filling the RAM up with other stuff before trying the test again. – Stefan Pochmann May 12 '15 at 01:47
  • Can you tell me which HDD you have? – Stefan Pochmann May 12 '15 at 01:48
  • It's Seagate ST2000DM001. – Lucas May 12 '15 at 01:52
  • Hmm, that one can apparently read 200+ MB/s sequentially, so the 12 seconds sound reasonable if it does that with small blocks as well. Was that with the 128*md5.block_size btw, or with the suggested 1M blocks? – Stefan Pochmann May 12 '15 at 01:58
  • It was the 128*md5.block_size for the first three times, then I've changed it to the suggested 2^20. – Lucas May 12 '15 at 02:01
  • Not necessarily. Did you wait for the system to load completely, or was it maybe still busy with getting ready? Mine settles down like a minute after logging in. Also, had you accessed some of your files before your earlier 12 seconds test, so *part* of the 1 GB might've been in cache? – Stefan Pochmann May 12 '15 at 02:51
  • You can ignore the previous (removed) comment, it was a false result (because I've made a typo: 2*20 instead of 2^20). So, acorrding to the 12 seconds earlier, now (with the 1MB blocks) it takes 7 seconds. Thats better, but still not that fast. Do you think I can increase it to the 2MB blocks or it would be too much? Is there anything else I can do? – Lucas May 12 '15 at 03:27
  • Yes, 2MB are alright as well. Your program will take around 10MB anyway and you surely have plenty more RAM. But I doubt it'll help much. Your original size, 128*md5.block_size, at least on my PC, is only 8KB. There's a big difference (or rather factor) between 8KB and 1MB, but not so much between 1MB and 2MB. – Stefan Pochmann May 12 '15 at 03:35
  • Yeah, like I said. I can't get lower than 7 seconds on my HDD. Thats still a lot. Curious if there is something left I can do, other than that. – Lucas May 12 '15 at 05:45
  • I wouldn't call 7 seconds a lot, you're reading (and processing) a lot of data from HDD, after all. And it's nowhere near the 4 minutes. Btw, any chance those users had the funny idea of doing it from a USB stick? That could also explain the slowness. One more thing you could try is use `readinto`. I've updated [an answer of mine](http://stackoverflow.com/a/30064243/1672429) with that now. – Stefan Pochmann May 12 '15 at 14:18
  • It's worth to mention it's 7 seconds for me. While users are complaining about few minutes. There is no chance those users are doing it from the USB stick. My app is a file-patcher based app, so basically all it does is compare the hashes on the local(user) & remote server and download a fresh file if there is a difference. – Lucas May 12 '15 at 14:40
  • Yeah yeah, I understood that 7 seconds was for you, not those users, and I'm looking forward to know what difference the larger buffer makes for them. – Stefan Pochmann May 12 '15 at 15:17