4

I'm trying to determine the data consistency of some files. However, the MD5 keeps coming differently. When I execute md5sum then the hashes are equal:

import hashlib
import os
import sys

def hash_file_content(path):
    try:
        if not os.path.exists(path):
            raise IOError, "File does not exist"
        encode = hashlib.md5(path).hexdigest()
        return encode
    except Exception, e:
        print e

def main():
    hash1 = hash_file_content("./downloads/sample_file_1")
    hash2 = hash_file_content("./samples/sample_file_1")

    print hash1, hash2

if __name__ == "__main__":
   main()

the output is unexpectedly different:

baed6a40f91ee5c44488ecd9a2c6589e 490052e9b1d3994827f4c7859dc127f0

now with md5sum:

md5sum ./samples/sample_file_1
9655c36a5fdf546f142ffc8b1b9b0d93  ./samples/sample_file_1

md5sum ./downloads/sample_file_1 
9655c36a5fdf546f142ffc8b1b9b0d93  ./downloads/sample_file_1

Why is this happening and how can I resolved this?

Anshul Goyal
  • 73,278
  • 37
  • 149
  • 186
cybertextron
  • 10,547
  • 28
  • 104
  • 208

1 Answers1

7

In your code, you are calculating the md5 of the file path, and not of the file's content:

...
encode = hashlib.md5(path).hexdigest()
...

Instead, calcutate the md5 of the contents of the file:

with open(path, "r") as f:
    encode = md5(f.read()).hexdigest()

and this should give you matching output (i.e., matches amongst each other and is same as that of md5sum).


Since the file sizes are big, doing f.read() in single go would get too taxing, and will simply not work when the file size exceeds your available memory.

So instead, take advantage of the fact that internally, md5 uses its update method to compute the hash over chunks, and define a method which uses the md5.update, and call it in the code, as mentioned in this answer:

import hashlib

def md5_for_file(filename, block_size=2**20):
    md5 = hashlib.md5()
    with open(filename, "rb") as f:
        while True:
            data = f.read(block_size)
            if not data:
                break
            md5.update(data)
    return md5.digest()

and call this in your code now:

encode = md5_for_file(path)
Community
  • 1
  • 1
Anshul Goyal
  • 73,278
  • 37
  • 149
  • 186
  • The file is huge, some are in the gigabytes ... f.read() is the correct one to use, or just a token of the file, lets say 1024? – cybertextron Feb 12 '15 at 19:58
  • @philippe f.read() would be too taxing. Use the approach mentioned here in that case -> stackoverflow.com/a/1131255/1860929 , have edited the same into my answer. – Anshul Goyal Feb 12 '15 at 20:15