1

I am comparing two qcow2 images file at two different location to see difference. /opt/images/file.qcow2 /mnt/images/file.qcow2

When i run

md5sum /opt/images/file.qcow2 
md5sum  /mnt/images/file.qcow2

both files the checksum is same

But when try to find md5sum using following piece of code

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = hashlib.md5(file1).hexdigest()
        md5File2 = hashlib.md5(file2).hexdigest()
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

It says checksum is not same

UPDATE File size can of size of 8 GB

Vikram Ranabhatt
  • 7,268
  • 15
  • 70
  • 133
  • 1
    You are hashing the name of the file, not the content ... – Cyrbil Jul 04 '16 at 09:46
  • 1
    What Cybril says. More exactly, you're passing `file1` to `md5()`, which turns it into a string using `str()`. Even if it was an open file, that would yield a result like `"<_io.TextIOWrapper name='/dev/null' mode='r' encoding='UTF-8'>"`, and you would hash that. – spectras Jul 04 '16 at 09:47
  • I think he only has the path of the file (or else `isfile()` would have cried a bit), not a file descriptor... – Cyrbil Jul 04 '16 at 09:53

3 Answers3

6

You are hashing the path of the file, not the content ...

hashlib.md5(file1).hexdigest()  # file1 == '/path/to/file.ext'

To hash the content:

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(16384), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = md5(file1)
        md5File2 = md5(file2)
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

Sidenote: You probably want to use hashlib.sha1() (with unix's sha1sum) instead of md5 which is broken and deprecated...

Edit: Benchmark with various buffersize and md5 vs sha1 Using a 100mB random file on a crappy server (Atom N2800 @1.86GHz):

┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Algorithm ┃  Buffer ┃    Time (s)   ┃
┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│    md5sum │     --- │ 0.387         │
│       MD5 │     2⁶  │ 21.5670549870 │
│       MD5 │     2⁸  │ 6.64844799042 │
│       MD5 │     2¹⁰ │ 3.12886619568 │
│       MD5 │     2¹² │ 1.82865810394 │
│       MD5 │     2¹⁴ │ 1.27349495888 │
│       MD5 │   128¹  │ 11.5235209465 │
│       MD5 │   128²  │ 1.27280807495 │
│       MD5 │   128³  │ 1.16839885712 │
│   sha1sum │    ---  │ 1.013         │
│      SHA1 │     2⁶  │ 23.4520659447 │
│      SHA1 │     2⁸  │ 7.75686216354 │
│      SHA1 │     2¹⁰ │ 3.82775402069 │
│      SHA1 │     2¹² │ 2.52755594254 │
│      SHA1 │     2¹⁴ │ 1.93437695503 │
│      SHA1 │   128¹  │ 12.9430441856 │
│      SHA1 │   128²  │ 1.93382811546 │
│      SHA1 │   128³  │ 1.81412386894 │
└───────────┴─────────┴───────────────┘

So md5sum is faster than sha1sum and python's implementations shows the same. Having a bigger buffer increases performance but within a limit (16384 seems a good tradeoff (not too big and efficient)).

Cyrbil
  • 6,341
  • 1
  • 24
  • 40
  • 1
    Sure, it's probably not wise these days to use MD5 in a security context, but it's still fine to use it to find duplicate files, assuming the files you're checking haven't been manipulated by someone with malicious intent. – PM 2Ring Jul 04 '16 at 10:41
  • For 3.8 GB file so far it took 10 minute ..and still going on...Is there any faster way to calculate this ? – Vikram Ranabhatt Jul 04 '16 at 12:58
  • It is already pretty optimized, I used `8192` as buffer size because it's a multiple of `128` (default md5 block size). I will try different solutions and come back to you. – Cyrbil Jul 04 '16 at 13:04
1

Try this:

from hashlib import md5

def md5File(filename):
    hasher = md5()
    blockSize = 16 * 1024 * 1024

    with open(filename, 'rb') as f:
        while True:
            fileBuffer = f.read(blockSize)
            if not fileBuffer:
                break

            hasher.update(fileBuffer)

    return hasher.hexdigest()

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = md5File(file1)
        md5File2 = md5File(file2)
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    return md5File1 == md5File

When you just do hashlib.md5(file1).hexdigest(), you're literally just md5'ing the name of the file. You actually want to md5 the content, which requires opening and reading the file using Python file operations. The method I've posted above can hash a large file without reading the whole thing into memory.

Will
  • 24,082
  • 14
  • 97
  • 108
1

How about using the code below:

def isImageLatest(file1,file2):
    print('Checking md5sum of {} {}'.format(file1, file2))

    if os.path.isfile(file1) and os.path.isfile(file2):
        md5File1 = hashlib.md5(open(file1,"rb").read()).hexdigest()
        md5File2 = hashlib.md5(open(file2,"rb").read()).hexdigest()
        print('md5sum of {} is {}'.format(file1, md5File1))
        print('md5sum of {} is {}'.format(file2, md5File2))
    else:
        print('Either {} or {} File not found'.format(file1,file2))
        return False

    if md5File1 == md5File2:
        return True
    else:
        return False

Please note that this is great for smaller files. If the file is large it is good to read the file chunk-by-chunk like the examples given above. For this case, you could use the following code:

import time
import hashlib
import time
with open("Some_Very_Large_File", "rb") as f:
    hasher = hashlib.md5()
    a = time.time()
    while True:
        data = f.read(3 * 1024)
        if not data:
            break
        hasher.update(data)
    print hasher.hexdigest()
    b = time.time()
    print "Done hashing in ", b - a, " seconds"

Following are the benchmarks I observed:

3.26GB media file and calculated the hash in 11.26 sec.
4.8GB file and hash calculated in 16.47 sec.
10.8GB file and hash calculated in 102.36 sec.

Please try the code and do let me know.

Mahadeva
  • 1,584
  • 4
  • 23
  • 56
  • I will try and let you know – Vikram Ranabhatt Jul 04 '16 at 11:59
  • Thanksw @Sarvagya Pant If the file size ranges from 3GB to 9GB. will this work ? – Vikram Ranabhatt Jul 04 '16 at 12:59
  • Well if the file size is too large, it is good to consume the data chunk by chunk (particularly 3mb or more) and call the `update` on the hash function. And once all of contents are consumed, you can get the hash using `hexdigest` – Mahadeva Jul 04 '16 at 13:19
  • If file size is small, above code can work fine. Only drawback is that `.read()` reads the content into memory. So for large files, it is not recommended. – Mahadeva Jul 04 '16 at 13:24
  • Please check the code above. I have added the code and benchmark that I have run on my machine. – Mahadeva Jul 04 '16 at 13:51