Hashing a file in Python

Question

I want python to read to the EOF so I can get an appropriate hash, whether it is sha1 or md5. Please help. Here is what I have so far:

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

I want it to be able to hash a file. I need it to read until the EOF, whatever the file size may be. — user3358300, Feb 27 '14 at 03:00
that is exactly what `file.read()` does - read the entire file. — isedev, Feb 27 '14 at 03:01
With the code I have it reads and hashes the file but I verified it and the hash given by my program is wrong. I have read on here in similar cases that it must go through a loop in order to read the whole file but I can't figure out how to make it work for my code. Take this one for example: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python?rq=1 — user3358300, Feb 27 '14 at 03:09
@user3358300 you may want to take a look at the code I've shown in my answer below. I think it may help. — Randall Hunt, Feb 27 '14 at 04:18
How can I get the SHA256 hash of a large file in Python2 that will match the ones provided in ASC files? — user324747, Apr 09 '20 at 23:45
https://www.quickprogrammingtips.com/python/how-to-calculate-sha256-hash-of-a-file-in-python.html ???? — user324747, Apr 10 '20 at 02:35
SHA1 should not be used anymore because it has been proven to be possible to [generate multiple files with the same SHA1 hash](https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html). SHA256 and SHA3 are considered far more secure. — user9811991, Dec 05 '20 at 02:18
By the way: There is a command line tool called `sha256sum`. Just in case somebody just wants to apply it to a single file — Martin Thoma, Aug 23 '21 at 11:43

score 252 · Answer 1 · edited Dec 24 '22 at 20:30

252

TL;DR use buffers to not use tons of memory.

We get to the crux of your problem, I believe, when we consider the memory implications of working with very large files. We don't want this bad boy to churn through 2 gigs of ram for a 2 gigabyte file so, as pasztorpisti points out, we gotta deal with those bigger files in chunks!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

What we've done is we're updating our hashes of this bad boy in 64kb chunks as we go along with hashlib's handy dandy update method. This way we use a lot less memory than the 2gb it would take to hash the guy all at once!

You can test this with:

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python

Addendum!

In general when writing python it helps to get into the habit of following [pep-8][4]. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.

edited Dec 24 '22 at 20:30

starball

20,030
7
43
238

answered Feb 27 '14 at 03:52

Randall Hunt

12,132
6
32
42

@ranman Hello, I couldn't get the {0}".format(sha1.hexdigest()) part. Why do we use it instead of just using sha1.hexdigest() ? – Belial Jul 08 '15 at 14:25
@Belial What wasn't working? I was mainly just using that to differentiate between the two hashes... – Randall Hunt Sep 11 '15 at 22:47
@ranman Everything is working, I just never used this and haven't seen it in the literature. "{0}".format() ... unknown to me. :) – Belial Sep 12 '15 at 11:26
1

How should I choose `BUF_SIZE`? – Martin Thoma Aug 08 '17 at 15:09
@ranman If you had n files, what would be the run time? I'm curious how the buffer size affects it. – TheRealFakeNews Nov 05 '17 at 19:50
AFIAK the asymptotic (like BigO style) runtime is not different for N files when using buffers vs when not using buffers. The real runtime may indeed be different though. It can take longer to allocate larger buffers but allocating a buffer also has a fixed constant cost of asking the operating system to do something for you. You'd have to experiment to find something optimal. It might be worth it to have one thread going through and getting the file sizes and setting up an optimal buffer size map as you're iterating through your files. Beware premature optimizations though! – Randall Hunt Nov 06 '17 at 07:18
1

This does doesn't generate the same results as the `shasum` binaries. The other answer listed below (the one using memoryview) is compatible with other hashing tools. – Robert Hafner Jan 31 '19 at 18:53
@tedivm Sure? Tested it with Python2/3 and got the same results compared to sha1sum and md5sum – Murmel Sep 19 '19 at 09:07
@RandallHunt What about using the hash's block size as buffer size, like [Mitar](https://stackoverflow.com/a/55542529/1885518) does? – Murmel Sep 19 '19 at 09:58
The original version of this answer was written in 2014 so it's very possible there's a better way of doing things now. I'd just add that benchmarking is probably the most effective method - the open buffer size, filesystem buffer size, and algorithm buffer size are likely all different and simply reading the block size of the hashing algo may not be the most efficient method. If someone tries it all out I'm happy to update the answer. – Randall Hunt Sep 25 '19 at 09:00
SHA1 should not be used anymore because it has been proven to be possible to [generate multiple files with the same SHA1 hash](https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html). SHA256 and SHA3 are considered far more secure. – user9811991 Dec 05 '20 at 02:18

maxschlepzig · Answer 2 · 2023-03-05T13:39:23.160

140

If you don't need to support Python versions before 3.11, you can use hashlib.file_digest() like this:

import hashlib

def sha256sum(filename):
    with open(filename, 'rb', buffering=0) as f:
        return hashlib.file_digest(f, 'sha256').hexdigest()

When using a Python 3 version less than 3.11: For the correct and efficient computation of the hash value of a file:

Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        while n := f.readinto(mv):
            h.update(mv[:n])
    return h.hexdigest()

Note that the while loop uses an assignment expression which isn't available in Python versions older than 3.8.

With older Python 3 versions you can use an equivalent variation:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

edited Mar 05 '23 at 13:39

answered Jul 02 '17 at 17:23

maxschlepzig

35,645
14
145
182

6

How do you know what is an optimal block size? – Mitar Mar 02 '18 at 05:45
5

@Mitar, a lower bound is the maximum of the physical block (traditionally 512 bytes or 4KiB with newer disks) and the systems page size (4KiB on many system, other common choices: 8KiB and 64 KiB). Then you basically do some benchmarking and/or look at published [benchmark results and related work](http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/ioblksize.h;h=ed2f4a9c4d77462f357353eb73ee4306c28b37f1;hb=HEAD#l23) (e.g. check what current rsync/GNU cp/... use). – maxschlepzig Mar 02 '18 at 20:31
Would [`resource.getpagesize`](https://docs.python.org/2/library/resource.html#resource.getpagesize) be of any use here, if we wanted to try to optimize it somewhat dynamically? And what about [`mmap`](https://docs.python.org/3/library/mmap.html)? – jpmc26 May 14 '18 at 17:40
1

@jpmc26, getpagesize() isn't that useful here - common values are 4 KiB or 8 KiB, something in that range, i.e. something much smaller than the 128 KiB - 128 KiB is generally a good choice. mmap doesn't help much in our use case as we sequentially read the complete file from front to back. mmap has advantages when the access pattern is more random-access like, if pages are accessed more than once and/or if it the mmap simplifies read buffer management. – maxschlepzig May 15 '18 at 08:13
1

Unlike the "top voted" answer this answer actually provides the same results as the `shasum` function. – Robert Hafner Jan 31 '19 at 18:47
1

@tedivm, that's probably because this answer is using sha256 while Randall's answer uses sha1 and md5, which are the hashing algorithms specified by the OP. Try comparing their results to `sha1sum` and `md5sum`. – Kyle A Mar 09 '19 at 00:52
6

I benchmarked both the solution of (1) @Randall Hunt and (2) yours (in this order, is important due to file cache) with a file of around 116GB and sha1sum algorithm. Solution 1 was modified in order to use a buffer of 20 * 4096 (PAGE_SIZE) and set buffering parameter to 0. Solution 2 only algorithm was modified (sha256 -> sha1). Result: (1) 3m37.137s (2) 3m30.003s . The native sha1sum in binary mode: 3m31.395s – bioinfornatics Jul 19 '19 at 09:55
This might be a good solution for specific use cases (equally sized files, time to do benchmarking), but I miss a note about `open()` already using buffering on its own which might be the best option for a general purpose implementation. See [Mitar's answer](https://stackoverflow.com/a/55542529/1885518) for more – Murmel Sep 19 '19 at 09:53
3

@Murmel what do you mean with 'equally sized files'? This answer is a general purpose solution. If you call `open()` with `buffering=0` it doesn't do any buffering. Mitar's answer implements buffer churning. – maxschlepzig Sep 19 '19 at 17:02
2

To clarify: The only reason you're using `memoryview` is the `[:n]`, right? Btw since Python 3.8, maybe `while n := f.readinto(mv):` would be clearer. – Kelly Bundy Apr 15 '22 at 00:48
1

@KellyBundy - yes, to make creating an object that gives access to the first n bytes cheap. Creating a slice from bytearry `b` via ` b[:n]` would yield a memory copy of n bytes. Good point regarding assignment expressions. I created the answer before they were proposed/available, but I agree that they simplify the code and thus updated my answer. – maxschlepzig Apr 15 '22 at 09:27
Alternatively we could read into `b` directly and then `h.update(b[:n] if n < N else b)`, so only the last chunk should get sliced. I tried that (with a 128 MB file), thinking the extra `memoryview` "layer" might slow it down slightly, but couldn't reliably measure any difference. – Kelly Bundy Apr 15 '22 at 13:00
@KellyBundy memoryview is basically just a pointer and a size integer, so it's a pretty low overhead proxy object. Depending on the source, `readinto()` might return less bytes than requested even in non-EOF situations. – maxschlepzig Apr 15 '22 at 13:24
Yes, that's what I meant with "should". *Ideally* only the last, and also the few times I checked, it always was only the last. – Kelly Bundy Apr 15 '22 at 13:47

score 42 · Answer 3 · answered Apr 05 '19 at 19:56

42

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.

answered Apr 05 '19 at 19:56

Mitar

6,756
5
54
86

Perfect answer, but it would be nice, if you would back your statements with the related doc: [Python3 - open()](https://docs.python.org/3/library/functions.html#open) and [Python2 - open()](https://docs.python.org/2/library/functions.html#open). Even mind the diff between both, Python3's approach is more sophisticated. Nevertheless, I really appreciated the consumer-centric perspective! – Murmel Sep 19 '19 at 09:28
2

`hash.block_size` is documented just as the 'internal block size of the hash algorithm'. Hashlib **doesn't** find it _ideal_. Nothing in the package documentation suggests that `update()` prefers `hash.block_size` sized input. It doesn't use less CPU if you call it like that. Your `file.read()` call leads to many unnecessary object creations and superfluous copies from the file buffer to your new chunk bytes object. – maxschlepzig Sep 19 '19 at 17:15
Hashes update their state in `block_size` chunks. If you are not providing them in those chunks, they have to buffer and wait for enough data to appear, or split given data into chunks internally. So, you can just handle that on the outside and then you simplify what happens internally. I find this ideal. See for example: https://stackoverflow.com/a/51335622/252025 – Mitar Sep 19 '19 at 21:04
3

The `block_size` is much smaller than any useful read size. Also, any useful block and read sizes are powers of two. Thus, the read size is divisible by the block size for all reads except possibly the last one. For example, the sha256 block size is 64 bytes. That means that `update()` is able to directly process the input without any buffering up to any multiple of `block_size`. Thus, only if the last read isn't divisible by the block size it has to buffer up to 63 bytes, once. Hence, your last comment is incorrect and doesn't support the claims you are making in your answer. – maxschlepzig Nov 05 '19 at 20:43
The point is that one does not have to optimize buffering because it is already done by Python when reading. So you just have to decide on the amount of looping you want to do when hashing over that existing buffer. – Mitar Nov 06 '19 at 04:44
5

This solution does not live up to a simple benchmark! On my 1Gb file, it is more than twice as slow (5.38s) as Randall Hunt's answer (2.18s), which is itself very slightly slower than maxschlepzig's answer (2.13s). – Gaëtan de Menten Dec 03 '21 at 10:42

score 10 · Answer 4 · answered Jun 05 '20 at 11:54

10

Here is a Python 3, POSIX solution (not Windows!) that uses mmap to map the object into memory.

import hashlib
import mmap

def sha256sum(filename):
    h  = hashlib.sha256()
    with open(filename, 'rb') as f:
        with mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ) as mm:
            h.update(mm)
    return h.hexdigest()

answered Jun 05 '20 at 11:54

Antti Haapala -- Слава Україні

129,958
22
279
321

1

Naive question ... what is the advantage of using `mmap` in this scenario? – Jonathan B. Sep 28 '20 at 17:42
1

@JonathanB. most methods needlessly create `bytes` objects in memory, and call `read` too many or too little times. This will map the file directly into the virtual memory, and hash it from there - the operating system can map the file contents directly from the buffer cache into the reading process. This means this could be faster by a significant factor over [this one](https://stackoverflow.com/a/22058673/918959) – Antti Haapala -- Слава Україні Sep 28 '20 at 18:15
@JonathanB. I did the test and the difference is not that significant in *this* case, we're talking about ~15 % over the naive method. – Antti Haapala -- Слава Україні Sep 28 '20 at 18:26
4

I benchmarked this vs the read chunk by chunk method. This method took 3GB memory for hashing a 3GB file while maxschlepzig's answer took 12MB. They both roughly took the same amount of time on my Ubuntu box. – Seperman Mar 17 '21 at 18:40
@Seperman you're measuring the RAM usage incorrectly. The memory is still available, the pages are mapped from the buffer cache. – Antti Haapala -- Слава Україні Mar 17 '21 at 19:09
@AnttiHaapala That makes sense. How do you recommend I measure the RAM usage of the process on Linux to see the mmap usage vs physical memory usage? – Seperman Mar 17 '21 at 22:39
For example when I look at Htop, these are some numbers I see: VIRT: 2884M, RES: 2122M. From my understanding RES is the physical RAM that is used. – Seperman Mar 17 '21 at 23:42
@Seperman well yes, that is more appropriate number than VIRT - but I added `os.system("free")` in between several points there and the "available" memory doesn't decrease. – Antti Haapala -- Слава Україні Mar 18 '21 at 04:34
1

FWIW, with Python >= 3.8 one can add `mm.madvise(mmap.MADV_SEQUENTIAL)` in order to reduce buffer cache pressure somewhat. – maxschlepzig Jul 09 '21 at 20:08
1

FWIW, using "access=mmap.ACCESS_READ" instead of "prot=mmap.PROT_READ" makes this work on Windows (but it is slightly slower than simply reading in chunks) – Gaëtan de Menten Dec 03 '21 at 11:08

score 5 · Answer 5 · answered Feb 12 '18 at 00:57

5

I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

answered Feb 12 '18 at 00:57

1cedsoda

623
4
16

1

Is it cross-platform (Linux + Win)? Is it working with Python3? Also is it still maintained? – Basj Nov 07 '20 at 17:28
Yes it is cross platform and will still work. Also the other stuff in the package works fine. But I will no longer maintain this package of personal experiments, because it was just a learning for me as a developer. – 1cedsoda Nov 14 '20 at 22:22
FWIW, [this fileChecksum() function](https://github.com/1cedsoda/py_essentials/blob/9b8295590fe2a7879097b6b1bcfbe71f250ae8d4/py_essentials/hashing.py#L10-L35) is very unpythonic, it duplicates the checking of supported hash algorithms that is done by hashlib, implements buffer churning (of 64 KiB buffers), contains a conditional print statement, eats exceptions and simply returns `"ERROR"` when the file can't be opened due to a permission error. – maxschlepzig Sep 15 '22 at 11:15

Artin Mohammadi · Answer 6 · 2022-04-20T03:31:35.390

You do not need to define a function with 5-20 lines of code to do this! Save your time by using the pathlib and hashlib libraries, also py_essentials is another solution, but third-parties are *****.

from pathlib import Path
import hashlib

filepath = '/path/to/file'
filebytes = Path(filepath).read_bytes()

filehash_sha1 = hashlib.sha1(filebytes)
filehash_md5 = hashlib.md5(filebytes)

print(f'MD5: {filehash_md5}')
print(f'SHA1: {filehash_sha1}')

I used a few variables here to show the steps, you know how to avoid it.

What do you think about the below function?

from pathlib import Path
import hashlib


def compute_filehash(filepath: str, hashtype: str) -> str:
    """Computes the requested hash for the given file.

    Args:
        filepath: The path to the file to compute the hash for.
        hashtype: The hash type to compute.

          Available hash types:
            md5, sha1, sha224, sha256, sha384, sha512, sha3_224,
            sha3_256, sha3_384, sha3_512, shake_128, shake_256

    Returns:
        A string that represents the hash.
    
    Raises:
        ValueError: If the hash type is not supported.
    """
    if hashtype not in ['md5', 'sha1', 'sha224', 'sha256', 'sha384',
                        'sha512', 'sha3_224', 'sha3_256', 'sha3_384',
                        'sha3_512', 'shake_128', 'shake_256']:
        raise ValueError(f'Hash type {hashtype} is not supported.')
    
    return getattr(hashlib, hashtype)(
        Path(filepath).read_bytes()).hexdigest()

This reads the complete file into memory for computing the hash - which is ok for very small files but quite wasteful for others. If ou want to compute the hash of an 1 GB file then you need > 1 GB of RAM for just computing the hash. Of course this doesn't scale. Also, you present writing a 5-20 line helper function as disadvantage but then post an example function that consists of 7 lines of code and occupies 24 lines in total ... — maxschlepzig, Apr 15 '22 at 09:35
Also, a more idiomatic way to deal with different hash types is to just call `hashlib.new(hashtype)` instead of `getattr(hashlib, hashtype)`. That package function already does proper value checking (e.g. `ValueError: unsupported hash type xyz`) such that you don't have to re-implement it. — maxschlepzig, Apr 15 '22 at 09:47
@maxschlepzig You mentioned two good things about both the performance and the `hashlib.new`, thanks! But do you suggest a better way to handle this situation? Any tool or function?! — Artin Mohammadi, Apr 20 '22 at 03:37
Well, I posted an [answer](https://stackoverflow.com/a/44873382/427158) that demonstrates how to hash large (or small) files while only using constant memory. — maxschlepzig, Apr 20 '22 at 20:15

score 3 · Answer 7 · answered Feb 05 '23 at 23:37

3

Starting Python 3.11, you can use file_digest() method, which takes responsibility of reading files:

import hashlib

with open(inputFile, "rb") as f:
    digest = hashlib.file_digest(f, "sha256")

answered Feb 05 '23 at 23:37

greatvovan

2,439
23
43

1

This uses the exactly same algorithm as given in [the answer by maxschlepzig](https://stackoverflow.com/a/44873382/984421), so presumably the performance will be the same. – ekhumoro Feb 06 '23 at 22:17
FTR, direct link to Python's [file_digest implementation](https://github.com/python/cpython/blob/66aa78cbe604a7c5731f074b869f92174a8e3b64/Lib/hashlib.py#L228-L238). – maxschlepzig Mar 05 '23 at 13:27
digest = hashlib.file_digest(f, "sha256") AttributeError: module 'hashlib' has no attribute 'file_digest' – Boris Ivanov Apr 04 '23 at 13:54
Check that your Python version is not lower than said in the answer. – greatvovan Apr 05 '23 at 04:54

score 1 · Answer 8 · answered Dec 03 '21 at 13:27

FWIW, I prefer this version, which has the same memory and performance characteristics as maxschlepzig's answer but is more readable IMO:

import hashlib

def sha256sum(filename, bufsize=128 * 1024):
    h = hashlib.sha256()
    buffer = bytearray(bufsize)
    # using a memoryview so that we can slice the buffer without copying it
    buffer_view = memoryview(buffer)
    with open(filename, 'rb', buffering=0) as f:
        while True:
            n = f.readinto(buffer_view)
            if not n:
                break
            h.update(buffer_view[:n])
    return h.hexdigest()

score -3 · Answer 9 · answered Jun 10 '18 at 09:04

-3

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

answered Jun 10 '18 at 09:04

Ome Mishra

59
3

3

You are basically doing `echo $USER_INPUT | md5sum > encrypted.txt && cat encrypted.txt` which does not deal with hashing of files, especially not with big ones. – Murmel Sep 19 '19 at 09:35
2

hashing != encrypting – bugmenot123 Dec 22 '19 at 14:05

Hashing a file in Python

9 Answers9

Addendum!

Linked

Related