464

Is there any simple way of generating (and checking) MD5 checksums of a list of files in Python? (I have a small program I'm working on, and I'd like to confirm the checksums of the files).

Sнаđошƒаӽ
  • 16,753
  • 12
  • 73
  • 90
Alexander
  • 5,661
  • 6
  • 24
  • 19
  • 4
    Why not just use [`md5sum`](http://en.wikipedia.org/wiki/Md5sum)? – kennytm Aug 07 '10 at 19:55
  • 118
    Keeping it in Python makes it easier to manage the cross-platform compatibility. – Alexander Aug 07 '10 at 20:00
  • If you want de solution with "progress bar* or similar (for very big files), consider this solution: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python/40961519#40961519 – Laurent LAPORTE Dec 04 '16 at 17:59
  • 2
    @kennytm The link you provided says this in the second paragraph: "The underlying MD5 algorithm is no longer deemed secure" while describing `md5sum`. That is why security-conscious programmers should not use it in my opinion. – Debug255 Feb 12 '18 at 08:54
  • 1
    @Debug255 Good and valid point. Both `md5sum` and the technique described in this SO question should be avoided - it's better to use SHA-2 or SHA-3, if possible: https://en.wikipedia.org/wiki/Secure_Hash_Algorithms – Per Lundberg Sep 27 '18 at 08:33
  • @PerLundberg or the newer [`hashlib.blake2b`](https://docs.python.org/3/library/hashlib.html#hashlib.blake2b) which is both faster than md5 and secure. – Boris Verkhovskiy Jan 01 '20 at 22:19
  • @Boris Thanks. Is BLAKE2b/BLAKE2s as widely available cross-platform as the SHA algorithms? (I hadn't heard about them before you mentioned them here) – Per Lundberg Jan 02 '20 at 08:59
  • @PerLundberg modern languages should implement them (I know Python, Go and Rust do). There's a `b2sum` command available on Ubuntu. – Boris Verkhovskiy Jan 02 '20 at 09:14
  • OK, nice. For reference: https://crypto.stackexchange.com/questions/45127/should-i-use-sha256-or-blake2-to-checksum-and-sign-scrypt-headers – Per Lundberg Jan 03 '20 at 11:23
  • 2
    Might be worth mentioning there are still valid reasons to use md5 that are not affected by it's brokenness for security purposes. (eg checking for bit rot in a system that uses baked in md5 creation during archival) – Smock Feb 05 '20 at 10:08

9 Answers9

649

You can use hashlib.md5()

Note that sometimes you won't be able to fit the whole file in memory. In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the md5 method:

import hashlib
def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Note: hash_md5.hexdigest() will return the hex string representation for the digest, if you just need the packed bytes use return hash_md5.digest(), so you don't have to convert back.

user2653663
  • 2,818
  • 1
  • 18
  • 22
quantumSoup
  • 27,197
  • 9
  • 43
  • 57
  • How could I decode the hex string ? It differs from the output of what `md5sum` returns – alper Nov 17 '21 at 20:09
  • @alper no it doesn't -- sorry to put it so flippantly-sounding, but there is no way that md5 differs for the same input -- if you're reading binary (not line-ending-agnostic) input, then this algorithm is deterministic -- md5's famous problem is that it might _FAIL TO DIFFER_ for two different inputs – rsandwick3 Oct 15 '22 at 04:43
  • @rsandwick3 As I understand md5 formula may end up generate same output for the two different inputs ? – alper Oct 15 '22 at 10:29
  • 2
    yes: https://crypto.stackexchange.com/questions/1434/are-there-two-known-strings-which-have-the-same-md5-hash-value – rsandwick3 Oct 18 '22 at 18:59
332

There is a way that's pretty memory inefficient.

single file:

import hashlib
def file_as_bytes(file):
    with file:
        return file.read()

print hashlib.md5(file_as_bytes(open(full_path, 'rb'))).hexdigest()

list of files:

[(fname, hashlib.md5(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

Recall though, that MD5 is known broken and should not be used for any purpose since vulnerability analysis can be really tricky, and analyzing any possible future use your code might be put to for security issues is impossible. IMHO, it should be flat out removed from the library so everybody who uses it is forced to update. So, here's what you should do instead:

[(fname, hashlib.sha256(file_as_bytes(open(fname, 'rb'))).digest()) for fname in fnamelst]

If you only want 128 bits worth of digest you can do .digest()[:16].

This will give you a list of tuples, each tuple containing the name of its file and its hash.

Again I strongly question your use of MD5. You should be at least using SHA1, and given recent flaws discovered in SHA1, probably not even that. Some people think that as long as you're not using MD5 for 'cryptographic' purposes, you're fine. But stuff has a tendency to end up being broader in scope than you initially expect, and your casual vulnerability analysis may prove completely flawed. It's best to just get in the habit of using the right algorithm out of the gate. It's just typing a different bunch of letters is all. It's not that hard.

Here is a way that is more complex, but memory efficient:

import hashlib

def hash_bytestr_iter(bytesiter, hasher, ashexstr=False):
    for block in bytesiter:
        hasher.update(block)
    return hasher.hexdigest() if ashexstr else hasher.digest()

def file_as_blockiter(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block
            block = afile.read(blocksize)


[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.md5()))
    for fname in fnamelst]

And, again, since MD5 is broken and should not really ever be used anymore:

[(fname, hash_bytestr_iter(file_as_blockiter(open(fname, 'rb')), hashlib.sha256()))
    for fname in fnamelst]

Again, you can put [:16] after the call to hash_bytestr_iter(...) if you only want 128 bits worth of digest.

Omnifarious
  • 54,333
  • 19
  • 131
  • 194
  • 79
    I'm only using MD5 to confirm the file isn't corrupted. I'm not so concerned about it being broken. – Alexander Aug 07 '10 at 20:03
  • 96
    @TheLifelessOne: And despite @Omnifarious scary warnings, that is perfectly good use of MD5. – President James K. Polk Aug 07 '10 at 20:09
  • 24
    @GregS, @TheLifelessOne - Yeah, and next thing you know someone finds a way to use this fact about your application to cause a file to be accepted as uncorrupted when it isn't the file you're expecting at all. No, I stand by my scary warnings. I think MD5 should be removed or come with deprecation warnings. – Omnifarious Aug 07 '10 at 20:21
  • 1
    While @quantumSoup has a viable answer, I believe this one should be selected as the proper method for retrieving a files md5 checksum. However, it could be simplified to "hashlib.md5(open(fname, 'r').read()).digest()". You should note that the "file" function was changed to "open" for use with python 2.7+ – Austin S. Aug 11 '12 at 22:42
  • @AustinS.: *nod* Yeah. I fixed it to say `open`. I believe that's worked ever since hashlib was introduced, and possible has always worked. Old habits die hard. – Omnifarious Aug 13 '12 at 06:24
  • 10
    I'd probably use .hexdigest() instead of .digest() - it's easier for humans to read - which is the purpose of OP. – zbstof Sep 25 '12 at 09:33
  • 1
    @Zotov: I would remove `hexdigest` from the standard hashlib hash function interface. I feel that it's an unnecessary wart. And I like making even small functions widely applicable. There are many cases in which the hex of the hash is quite unnecessarily verbose and making that the easiest to use version is encouraging people to be verbose when they don't have to be. But yes, in this case, for this specific purpose it is likely the better choice. I would still just use `binascii.hexlify` instead. :-) – Omnifarious Sep 25 '12 at 16:07
  • 25
    I used this solution but it uncorrectly gave the same hash for two different pdf files. The solution was to open the files by specifing binary mode, that is: [(fname, hashlib.md5(open(fname, **'rb'**).read()).hexdigest()) for fname in fnamelst] This is more related to the open function than md5 but I thought it might be useful to report it given the requirement for cross-platform compatibility stated above (see also: http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files). – BlueCoder Feb 26 '13 at 14:09
  • @BlueCoder: Oh, you're right. I should've done that. I'm so used to Unix where the two are synonymous. I'll fix it now. – Omnifarious Feb 26 '13 at 16:36
  • 10
    @Omnifarious Saying "remove MD5 from the Python library" or even just saying "add deprecation warning to Python library" is like saying "Python should not be used, if existing stuff requires MD5, please use something else". Explain security implications in docs, sure, but removal or even just deprecation is insane suggestion. – hyde Apr 03 '13 at 13:48
  • 6
    @hyde: Something has to be done to get people to stop using that stupid algorithm. I've had jobs where they persisted in using it even after I demonstrated that it created security holes (admittedly rather obscure ones) and that SHA had a faster implementation in OpenSSL, which was the library we were using. It's insane. – Omnifarious Apr 03 '13 at 15:58
  • Any way for this to be at most one order of magnitude slower than md5sum on the command line? – Nemo Feb 02 '14 at 16:18
  • 2
    For people using the `def hashfile` function above multiple times on the same file handle remember to reset the `afile` pointer when done reading each file. eg. `afile.seek(0)` – Larpon Sep 05 '14 at 13:11
  • 6
    Reminder: the known weaknesses for MD5 are collision attacks, and *not* [preimage attacks](http://en.wikipedia.org/wiki/Preimage_attack), so it is suitable for some cryptographic applications but not others. If you don't know the difference you shouldn't be using it, but don't discard it altogether. See http://www.vpnc.org/hash.html. – Jason S Apr 17 '15 at 19:19
  • 4
    is it ok to _not close_ opened files in those list comprehensions? – koddo Nov 18 '15 at 22:25
  • 2
    Yes, I wanted to ask the same thing. Isn't a close() missing here? – hyperknot Nov 27 '15 at 20:58
  • No, it is not okay. The files will be closed on garbage collection, likely in the end of the enclosing function. If, for example, the number of elements in fnamelist is greater than the limit set by your OS, it will fail. But that is irrelevant to the question asked. We should use SO to get the gist, not copy the snippets blindly. :) – Roman Shapovalov Sep 29 '16 at 14:45
  • @BlueCoder How did it happen that two different pdf files had the same hash, even if opened without `mode=rb`? Shouldn't `rt` simply convert newlines and otherwise be identical to `rb`? (I assume this is python 2, since in python 3 `hashlib.md5` requires `bytes`, and will simply refuse to accept a string,) – max Oct 11 '16 at 04:10
  • @RomanShapovalov - I was relying on the reference counted nature of Python objects. After each element of the list comprehension is evaluated, there are no more references to it. I do agree that's rather tenuous and relying overly much on implementation. :-/ I like the interface for `hashfile` though, it's more flexible because it handles anything that has `read`. – Omnifarious Dec 29 '16 at 07:51
  • 1
    @RomanShapovalov - I fixed it so that it no longer has a potential resource leak, even though the current CPython implementation doesn't. I agree that it should avoid leaking even on Jython or future possible implementations of CPython. – Omnifarious Jan 06 '17 at 20:00
  • @JasonS - I can stick my hand in liquid nitrogen briefly and it won't be harmed. That doesn't mean I should do it. There are lots of alternatives to MD5 that are widely available. There is no more reason for anybody to use it than there is for me to stick my hand in liquid nitrogen. – Omnifarious Dec 06 '17 at 08:54
  • 3
    Nope. Sorry. Bad analogy. – Jason S Dec 07 '17 at 13:16
  • @JasonS - Can you give a rational reason anybody should use MD5 that's not one of these two: "Well, I think I can get away with it in this circumstance." or "I have to interoperate with something else that uses MD5."? – Omnifarious Dec 07 '17 at 21:27
  • 5
    The entirety of life is about "I think I can get away with it in this circumstance" --- or more objectively stated, **risk management**, which applies to *all* cryptographic systems, MD5 and SHA1 included. Read up on the state-of-the-art on [MD5 preimage attacks](https://crypto.stackexchange.com/questions/41860/pre-image-attack-on-md5-hash/41865). I don't put bars on all my windows at home, and I use MD5 when I am doing garden-variety integrity checks where a malicious adversary is not present (e.g. copying files from one PC to another) – Jason S Dec 08 '17 at 17:36
  • 2
    https://web.archive.org/web/20150901084550/http://www.vpnc.org/hash.html -- "The difference between a collision attack and either of the two preimage attacks is crucial. At the time of this writing, there are no practical preimage attacks, meaning that if your use of hashes is only susceptible to preimage attacks, even MD5 is just fine because at attacker would have to make 2^128 guesses, which will be infeasable for many decades (if ever)." – Jason S Dec 08 '17 at 17:38
  • @JasonS - And in so doing, you are perpetuating the use and very existence of an algorithm that is broken for a wide variety of other uses. Using a proper algorithm isn't like putting bars on your windows. Using the right algorithm is a matter of typing a few letters differently. There is no good reason to use MD5 at all for anything. It has no quality that recommends it over SHA256 in any reasonable situation. – Omnifarious Dec 08 '17 at 17:40
  • 4
    I'm not continuing this discussion, you're just being ideological about your rejection of MD5. – Jason S Dec 08 '17 at 17:42
  • @JasonS - I would argue that you are being ideological in your refusal to reject an algorithm that has perfectly viable replacements that there is no good reason whatsoever to not use. "I learned to type MD5 darn it, and nobody is going to tell me I can't. Those other letters, they're weird and my fingers can't type them!" – Omnifarious Dec 08 '17 at 17:52
  • I just need to correct the same image, thus, using `hashlib.md5(open(full_path, 'rb').read()).hexdigest()` is good enough. Thanks! – Khanh Le Dec 22 '17 at 04:01
  • @LittleZero - Is md5 that much easier to type than sha256? I'm just poking at this, because it's better to just forget the broken algorithm ever existed, no matter how safe it is to use in certain contexts. Retrain yourself to never even think of using the broken algorithm, and then you won't end up using it when it matters. – Omnifarious Dec 22 '17 at 15:41
  • We should release resources. Open file with *with* statement or write code to close file. – Rohit Taneja Feb 24 '18 at 19:58
  • @RohitTaneja - Resources are being released. The file object is immediately associated with a `with` statement inside `file_as_blockiter`. – Omnifarious Feb 25 '18 at 00:06
  • 1
    @Omnifarious I am talking about the first 3 code snippets. EX `import hashlib [(fname, hashlib.md5(open(fname, 'rb').read()).digest()) for fname in fnamelst]` – Rohit Taneja Feb 25 '18 at 11:37
  • 1
    @RohitTaneja - Ahh, the ones I mean as bad examples. :-) Yes, I suppose I should fix that. They aren't supposed to be _that_ kind of bad example. – Omnifarious Feb 25 '18 at 15:42
  • @ChadLowe - That makes no sense. I just tested it, and it works fine on a zero length file. What problem did you have? Or did it just look wrong, and so you had to fix it? There is no reason the iterator has to yield at least once. It will just never call `update`, and that's the exact same result as if you feed `update` a single empty string. – Omnifarious Dec 16 '19 at 07:32
  • 1
    You are correct. I'm not sure what I was doing before, but your code works as expected now. Just goes to show, I should always look at my own code for the problem first ;) – Chad Lowe Dec 20 '19 at 17:20
36

I'm clearly not adding anything fundamentally new, but added this answer before I was up to commenting status, plus the code regions make things more clear -- anyway, specifically to answer @Nemo's question from Omnifarious's answer:

I happened to be thinking about checksums a bit (came here looking for suggestions on block sizes, specifically), and have found that this method may be faster than you'd expect. Taking the fastest (but pretty typical) timeit.timeit or /usr/bin/time result from each of several methods of checksumming a file of approx. 11MB:

$ ./sum_methods.py
crc32_mmap(filename) 0.0241742134094
crc32_read(filename) 0.0219960212708
subprocess.check_output(['cksum', filename]) 0.0553209781647
md5sum_mmap(filename) 0.0286180973053
md5sum_read(filename) 0.0311000347137
subprocess.check_output(['md5sum', filename]) 0.0332629680634
$ time md5sum /tmp/test.data.300k
d3fe3d5d4c2460b5daacc30c6efbc77f  /tmp/test.data.300k

real    0m0.043s
user    0m0.032s
sys     0m0.010s
$ stat -c '%s' /tmp/test.data.300k
11890400

So, looks like both Python and /usr/bin/md5sum take about 30ms for an 11MB file. The relevant md5sum function (md5sum_read in the above listing) is pretty similar to Omnifarious's:

import hashlib
def md5sum(filename, blocksize=65536):
    hash = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            hash.update(block)
    return hash.hexdigest()

Granted, these are from single runs (the mmap ones are always a smidge faster when at least a few dozen runs are made), and mine's usually got an extra f.read(blocksize) after the buffer is exhausted, but it's reasonably repeatable and shows that md5sum on the command line is not necessarily faster than a Python implementation...

EDIT: Sorry for the long delay, haven't looked at this in some time, but to answer @EdRandall's question, I'll write down an Adler32 implementation. However, I haven't run the benchmarks for it. It's basically the same as the CRC32 would have been: instead of the init, update, and digest calls, everything is a zlib.adler32() call:

import zlib
def adler32sum(filename, blocksize=65536):
    checksum = zlib.adler32("")
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            checksum = zlib.adler32(block, checksum)
    return checksum & 0xffffffff

Note that this must start off with the empty string, as Adler sums do indeed differ when starting from zero versus their sum for "", which is 1 -- CRC can start with 0 instead. The AND-ing is needed to make it a 32-bit unsigned integer, which ensures it returns the same value across Python versions.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
rsandwick3
  • 566
  • 4
  • 7
  • Could you possibly add a couple of lines comparing SHA1, and also zlib.adler32 maybe? – Ed Randall Apr 13 '15 at 06:34
  • 1
    @EdRandall: adler32 is really not worth bothering with, eg. http://www.leviathansecurity.com/blog/analysis-of-adler32 – MikeW Jan 20 '16 at 17:10
35

In Python 3.8+, you can can use the assignment operator := (along with hashlib) like this:

import hashlib
with open("your_filename.txt", "rb") as f:
    file_hash = hashlib.md5()
    while chunk := f.read(8192):
        file_hash.update(chunk)

print(file_hash.digest())
print(file_hash.hexdigest())  # to get a printable str instead of bytes

Consider using hashlib.blake2b instead of md5 (just replace md5 with blake2b in the above snippet). It's cryptographically secure and faster than MD5.

Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
23
hashlib.md5(pathlib.Path('path/to/file').read_bytes()).hexdigest()
johnson
  • 3,729
  • 3
  • 31
  • 32
  • 5
    Hi! Please add some explanation to your code as to why this is a solution to the problem. Furthermore, this post is pretty old, so you should also add some information as to why your solution adds something that the others have not already addressed. – d_kennetz Apr 24 '19 at 14:17
  • 6
    It's another memory inefficient way – Erik Aronesty Aug 21 '19 at 22:44
  • 2
    One-line solution. Perfect for a couple of tests! – breakthewall Jun 26 '20 at 08:18
10

In Python 3.11+, there's a new readable and memory-efficient method:

import hashlib
with open(path, "rb") as f:
    digest = hashlib.file_digest(f, "md5")
print(digest.hexdigest())
Daniel T
  • 827
  • 9
  • 14
3

You could use simple-file-checksum1, which just uses subprocess to call openssl for macOS/Linux and CertUtil for Windows and extracts only the digest from the output:

Installation:

pip install simple-file-checksum

Usage:

>>> from simple_file_checksum import get_checksum
>>> get_checksum("path/to/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("path/to/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'

The SHA1, SHA256, SHA384, and SHA512 algorithms are also supported.


1 Disclosure: I am the author of simple-file-checksum.

Sash Sinha
  • 18,743
  • 3
  • 23
  • 40
0

you can make use of the shell here.

from subprocess import check_output

#for windows & linux
hash = check_output(args='md5sum imp_file.txt', shell=True).decode().split(' ')[0]

#for mac
hash = check_output(args='md5 imp_file.txt', shell=True).decode().split('=')[1]
-1

change the file_path to your file

import hashlib
def getMd5(file_path):
    m = hashlib.md5()
    with open(file_path,'rb') as f:
        lines = f.read()
        m.update(lines)
    md5code = m.hexdigest()
    return md5code
HCHO
  • 103
  • 1
  • 5