Python MD5 not matching md5 in terminal

Question

I am getting MD5 of several files using python function:

filehash = hashlib.md5(file)
print "FILE HASH: " + filehash.hexdigest()

though when I go to the terminal and do a

md5 file

the result I'm getting is not the same my python script is outputting (they don't match). Any chance someone knows why?

Douglas Leeder · Accepted Answer · 2010-02-09T13:32:26.077

22

hashlib.md5() takes the contents of the file not its name.

See http://docs.python.org/library/hashlib.html

You need to open the file, and read its contents before hashing it.

f = open(filename,'rb')
m = hashlib.md5()
while True:
    ## Don't read the entire file at once...
    data = f.read(10240)
    if len(data) == 0:
        break
    m.update(data)
print m.hexdigest()

edited Feb 09 '10 at 13:32

answered Feb 09 '10 at 13:25

Douglas Leeder

52,368
9
94
137

@Douglas Leeder: Is there something magic about the 10240 bytes read in `f.read(10240)`? Any idea on the optimum size? – dawg Dec 27 '10 at 06:18
10K is just an arbitrary value. I'm not sure what would be the fastest size, or what the best trade off between space and size would be. I suspect that 10K is enough that the overhead of read calls isn't too significant. – Douglas Leeder Dec 29 '10 at 09:20
@drewk So a magic number, but not magically selected - just arbitrary. If it matters to you, then you can do a micro-bench mark, but I doubt it'll make much difference vs. disc speed. – Douglas Leeder Dec 29 '10 at 09:21
2

@Douglas Leeder: I have done a *little* testing. MD5 is 128 byte block size, and I don't think `m.update(data)` works well with a chuck that is not a multiple of 128. My disc has 4096 byte sectors, so it probably makes sense to use a multiple of that sector size and 128. I used `f.read(128*4096)` or 512k as a buffer. Works a little faster. I tried much smaller than that, like just 128*5, 128*10 and that was a lot slower. I would say that 10K size you used (128*80 btw) was the best speed vs size tradeoff possible. Bigger was about 5% or 10% faster for me. – dawg Dec 29 '10 at 15:11
To calculate the md5, you can see this answer: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python/40961519#40961519 – Laurent LAPORTE Dec 04 '16 at 17:48

score 6 · Answer 2 · answered Feb 09 '10 at 13:30

$ md5 test.py
MD5 (test.py) = 04523172fa400cb2d45652d818103ac3
$ python
Python 2.6.1 (r261:67515, Jul  7 2009, 23:51:51) 
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> s = open('test.py','rb').read()
>>> hashlib.md5(s).hexdigest()
'04523172fa400cb2d45652d818103ac3'

score 3 · Answer 3 · answered Feb 09 '10 at 13:27

3

Try this

filehash = hashlib.md5(open('filename','rb').read())
print "FILE HASH: " + filehash.hexdigest()

answered Feb 09 '10 at 13:27

MattH

37,273
11
82
84

score 1 · Answer 4 · answered Feb 09 '10 at 13:25

1

what is file? it should equal to open(filename, 'rb').read(). is it?

answered Feb 09 '10 at 13:25

SilentGhost

307,395
66
306
293

Python MD5 not matching md5 in terminal

4 Answers4