5

I am getting MD5 of several files using python function:

filehash = hashlib.md5(file)
print "FILE HASH: " + filehash.hexdigest()

though when I go to the terminal and do a

md5 file

the result I'm getting is not the same my python script is outputting (they don't match). Any chance someone knows why?

peterh
  • 11,875
  • 18
  • 85
  • 108
balgan
  • 388
  • 4
  • 9

4 Answers4

22

hashlib.md5() takes the contents of the file not its name.

See http://docs.python.org/library/hashlib.html

You need to open the file, and read its contents before hashing it.

f = open(filename,'rb')
m = hashlib.md5()
while True:
    ## Don't read the entire file at once...
    data = f.read(10240)
    if len(data) == 0:
        break
    m.update(data)
print m.hexdigest()
Douglas Leeder
  • 52,368
  • 9
  • 94
  • 137
  • @Douglas Leeder: Is there something magic about the 10240 bytes read in `f.read(10240)`? Any idea on the optimum size? – dawg Dec 27 '10 at 06:18
  • 10K is just an arbitrary value. I'm not sure what would be the fastest size, or what the best trade off between space and size would be. I suspect that 10K is enough that the overhead of read calls isn't too significant. – Douglas Leeder Dec 29 '10 at 09:20
  • @drewk So a magic number, but not magically selected - just arbitrary. If it matters to you, then you can do a micro-bench mark, but I doubt it'll make much difference vs. disc speed. – Douglas Leeder Dec 29 '10 at 09:21
  • 2
    @Douglas Leeder: I have done a *little* testing. MD5 is 128 byte block size, and I don't think `m.update(data)` works well with a chuck that is not a multiple of 128. My disc has 4096 byte sectors, so it probably makes sense to use a multiple of that sector size and 128. I used `f.read(128*4096)` or 512k as a buffer. Works a little faster. I tried much smaller than that, like just 128*5, 128*10 and that was a lot slower. I would say that 10K size you used (128*80 btw) was the best speed vs size tradeoff possible. Bigger was about 5% or 10% faster for me. – dawg Dec 29 '10 at 15:11
  • To calculate the md5, you can see this answer: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python/40961519#40961519 – Laurent LAPORTE Dec 04 '16 at 17:48
6
$ md5 test.py
MD5 (test.py) = 04523172fa400cb2d45652d818103ac3
$ python
Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> s = open('test.py','rb').read()
>>> hashlib.md5(s).hexdigest()
'04523172fa400cb2d45652d818103ac3'
telliott99
  • 7,762
  • 4
  • 26
  • 26
3

Try this

filehash = hashlib.md5(open('filename','rb').read())
print "FILE HASH: " + filehash.hexdigest()
MattH
  • 37,273
  • 11
  • 82
  • 84
1

what is file? it should equal to open(filename, 'rb').read(). is it?

SilentGhost
  • 307,395
  • 66
  • 306
  • 293