2

When we want to get the hash of a big file in Python, with Python's hashlib, we can process chunks of data of size 1024 bytes like this:

import hashlib

m = hashlib.md5()
chunksize = 1024
with open("large.txt", 'rb') as f:
    while True:
        chunk = f.read(chunksize)
        if not chunk:
            break
        m.update(chunk)
print(m.hexdigest())

or simply ignore the splitting in chunks, like this:

import hashlib
sha256 = hashlib.sha256()
with open(f, 'rb') as g:
    sha256.update(g.read())
print(sha256.hexdigest())

Finding a optimal implementation can be tricky, and would need some performance testing and improvements (1024 chunks? 4KB? 64KB? etc.), as detailed in Hashing file in Python 3? or Getting a hash string for a very large file

Question: Is there a cross-platform, ready-to-use, function to compute a MD5 or SHA256 of a big file, with Python? (such that we don't need to reinvent the wheel, or worry about the optimal chunk size, etc.)

Something like:

import hashlib

# get the result without having to think about chunks, etc.
hashlib.file_sha256('bigfile.txt')
Basj
  • 41,386
  • 99
  • 383
  • 673
  • @superbrain Copy paste from bad source https://www.kite.com/python/examples/3977/hashlib-get-the-checksum-of-a-large-file indeed! – Basj Nov 07 '20 at 17:00
  • The optimal chunk size would require knowledge of your hardware. If you are really concerned with performance, you could go more low-level and call the operating system tool for calculating md5 sums. for example, `md5sum` on Linux systems. – Erik Cederstrand Nov 07 '20 at 17:00
  • @ErikCederstrand I sometimes use Linux and Windows, so I was looking for a general-purpose solution. – Basj Nov 07 '20 at 17:01
  • Is there a limit to where the function is supposed to be? Python's standard library? PyPI? Somewhere else? Because yes, such a function exists, it's on my computer. (Well, it's not optimizing, I just use 1 MB chunks which seemed to work well). – superb rain Nov 07 '20 at 17:10
  • Gets a lot nicer with `while chunk := f.read(chunksize):`, btw. – superb rain Nov 07 '20 at 17:12
  • @superbrain if possible, in the standard library; if not, in PyPI. Feel free to post your implementation as an answer with this nice use of walrus operator, would be interesting for future reference! PS: so sad that one of the first answers on google leads to this wrong answer: https://www.kite.com/python/examples/3977/hashlib-get-the-checksum-of-a-large-file – Basj Nov 07 '20 at 17:18
  • It's just what you're doing, except as a function and using the walrus. I'm sure that has already been posted. – superb rain Nov 07 '20 at 17:19
  • Some interesting (and possibly better) ways [here](https://stackoverflow.com/q/22058048/13008439). – superb rain Nov 07 '20 at 17:23

2 Answers2

4

Are you sure you actually need to optimize this? I did some profiling, and on my computer there's not a lot to gain when the chunksize is not ridiculously small:

import os
import timeit

filename = "large.txt"
with open(filename, 'w') as f:
    f.write('x' * 100*1000*1000)  # Create 100 MB file

setup = '''
import hashlib

def md5(filename, chunksize):
    m = hashlib.md5()
    with open(filename, 'rb') as f:
        while chunk := f.read(chunksize):
            m.update(chunk)
    return m.hexdigest()
'''

for i in range(16):
    chunksize = 32 * 2**i
    print('chunksize:', chunksize)
    print(timeit.Timer(f'md5("{filename}", {chunksize})', setup=setup).repeat(2, 2))

os.remove(filename)

which prints:

chunksize: 32
[1.3256129720248282, 1.2988303459715098]
chunksize: 64
[0.7864588440279476, 0.7887071970035322]
chunksize: 128
[0.5426529520191252, 0.5496777250082232]
chunksize: 256
[0.43311091500800103, 0.43472746800398454]
chunksize: 512
[0.36928231100318953, 0.37598425400210544]
chunksize: 1024
[0.34912850096588954, 0.35173907200805843]
chunksize: 2048
[0.33507052797358483, 0.33372197503922507]
chunksize: 4096
[0.3222631579847075, 0.3201586640207097]
chunksize: 8192
[0.33291386102791876, 0.31049903703387827]
chunksize: 16384
[0.3095061599742621, 0.3061956529854797]
chunksize: 32768
[0.3073280190001242, 0.30928074003895745]
chunksize: 65536
[0.30916607001563534, 0.3033451830269769]
chunksize: 131072
[0.3083479679771699, 0.3039141249610111]
chunksize: 262144
[0.3087183449533768, 0.30319386802148074]
chunksize: 524288
[0.29915712698129937, 0.29429047100711614]
chunksize: 1048576
[0.2932401319849305, 0.28639856696827337]

This suggests that you can just chose a large, but not insane, chunksize. e.g. 1 MB.

Erik Cederstrand
  • 9,643
  • 8
  • 39
  • 63
  • Thanks for this performance test! Do you think there is a way with standard library or PyPI to not have to think about this at all, and just do `hashlib.sha256_file('test.txt')`? – Basj Nov 07 '20 at 17:26
  • 1
    You can just call `m.update()` once, with the entire file content: `m.update(f.read())`. But then you'd have the entire file in-memory, which I assume is why you want to split the read into reasonable chunks. But the helper method is 6 lines of code so I don't really see the problem :-) – Erik Cederstrand Nov 07 '20 at 17:33
4

Created a package simple-file-checksum for your use case that just uses subprocess to call openssl for macOS/Linux and CertUtil for Windows and extracts only the digest from the output.


Simple File Checksum

[source]

Returns the MD5, SHA1, SHA256, SHA384, or SHA512 checksum of a file.

Installation

Run the following to install:

pip3 install simple-file-checksum

Usage

Python:

>>> from simple_file_checksum import get_checksum
>>> get_checksum("tst/file.txt")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("tst/file.txt", algorithm="MD5")
'9e107d9d372bb6826bd81d3542a419d6'
>>> get_checksum("tst/file.txt", algorithm="SHA1")
'2fd4e1c67a2d28fced849ee1bb76e7391b93eb12'
>>> get_checksum("tst/file.txt", algorithm="SHA256")
'd7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592'
>>> get_checksum("tst/file.txt", algorithm="SHA384")
'ca737f1014a48f4c0b6dd43cb177b0afd9e5169367544c494011e3317dbf9a509cb1e5dc1e85a941bbee3d7f2afbc9b1'
>>> get_checksum("tst/file.txt", algorithm="SHA512")
'07e547d9586f6a73f73fbac0435ed76951218fb7d0c8d788a309d785436bbb642e93a252a954f23912547d1e8a3b5ed6e1bfd7097821233fa0538f3db854fee6'

Terminal:

$ simple-file-checksum tst/file.txt
9e107d9d372bb6826bd81d3542a419d6
$ simple-file-checksum tst/file.txt -a MD5
9e107d9d372bb6826bd81d3542a419d6
$ simple-file-checksum tst/file.txt -a SHA1
2fd4e1c67a2d28fced849ee1bb76e7391b93eb12
$ simple-file-checksum tst/file.txt -a SHA256
d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592
$ simple-file-checksum tst/file.txt -a SHA384
ca737f1014a48f4c0b6dd43cb177b0afd9e5169367544c494011e3317dbf9a509cb1e5dc1e85a941bbee3d7f2afbc9b1
$ simple-file-checksum tst/file.txt -a SHA512
07e547d9586f6a73f73fbac0435ed76951218fb7d0c8d788a309d785436bbb642e93a252a954f23912547d1e8a3b5ed6e1bfd7097821233fa0538f3db854fee6
Sash Sinha
  • 18,743
  • 3
  • 23
  • 40