3

I need to get a hash (digest) of a file in Python.

Generally, when processing any file content it is adviced to process it gradually line by line due to memory concerns, yet I need a whole file to be loaded in order to obtain its digest.

Currently I'm obtaining hash in this way:

import hashlib

def get_hash(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        data = file.read()
    h.update(data)
    digest = h.hexdigest()
    return digest

Is there any other way to perform this in more optimized or cleaner manner?

Is there any significant improvement in reading file gradually line by line over reading whole file at once when still the whole file must be loaded to calculate the hash?

Krzysiek
  • 7,895
  • 6
  • 37
  • 38
  • You can optimize this by using an optimal block size and avoiding buffer churning, see also my [answer to a similar question](https://stackoverflow.com/a/44873382/427158). – maxschlepzig Mar 05 '23 at 13:49

3 Answers3

3

Of course you can load data in chunks, so that the memory usage drops significantly as you no more have to load the whole file. Then you use hash.update(chunk) for each chunk:

from functools import partial

Hash = hashlib.new("sha1")
size = 128 # just an example

with open("data.txt", "rb") as File:
    for chunk in iter(partial(f.read, size), b''):
        Hash.update(chunk)

I find this iter trick very neat because it allows to write much cleaner code. It may look confusing at first, so I'll explain how it works:

  • iter(function, sentinel) executes function successively and yields the values it returns until one of them is equal to sentinel.
  • partial(f.read, size) returns a callable version of f.read(size). This is oversimplified, but still correct in this case.
ForceBru
  • 43,482
  • 10
  • 63
  • 98
  • Keep in mind that this approach, while effective for using the hash within your system, will not be compatible with tools that generate a hash of a file (which you should use anyways if you intend to get a hash for external use) – Jessie Mar 15 '17 at 20:09
  • opening in text mode is bad for hash – Jean-François Fabre Mar 15 '17 at 20:10
  • @Jean-FrançoisFabre, agreed, I just changed it to binary. Thanks for your comment. – ForceBru Mar 15 '17 at 20:11
  • BTW, reading line-by-line (or even 128 bytes at a time) may not be better than reading everything at once. You have gigabytes of RAM. You don't want to spend it all, but you don't want to read too small chunks either. You should measure performance, but I feel that reading chunks of multiple kilobytes should be best. – zvone Mar 15 '17 at 20:12
  • @zvone, probably bigger chunks are better. The number of bytes in my code is just an example. – ForceBru Mar 15 '17 at 20:13
  • @ForceBru Of course... my comment was meant more as addition to your answer than as a complaint to it :) – zvone Mar 15 '17 at 20:26
3

According to the documentation for hashlib.update(), you don't need to concern yourself over the block size of different hashing algorithms. However, I'd test that a bit. But, it seems to check out, 512 is the block size of MD5, and if you change it to anything else, the results are the same as reading it all in at once.

import hashlib

def get_hash(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        data = file.read()
    h.update(data)
    digest = h.hexdigest()
    return digest

def get_hash_memory_optimized(f_path, mode='md5'):
    h = hashlib.new(mode)
    with open(f_path, 'rb') as file:
        block = file.read(512)
        while block:
            h.update(block)
            block = file.read(512)

    return h.hexdigest()

digest = get_hash('large_bin_file')
print(digest)

digest = get_hash_memory_optimized('large_bin_file')
print(digest)

> bcf32baa9b05ca3573bf568964f34164
> bcf32baa9b05ca3573bf568964f34164
Eugene K
  • 3,381
  • 2
  • 23
  • 36
1

You get the same result with both snippets:

h = hashlib.new("md5")
with open(filename,"rb") as f:
    for line in f:
        h.update(line)
print(h.hexdigest())

and

h = hashlib.new("md5")
with open(filename,"rb") as f:
    h.update(f.read())

print(h.hexdigest())

A few notes:

  • the first approach works best with big text files, memory-wise. With binary file, there's no such thing as a "line". It will work, though, but a "chunk" approach is more regular (not going to paraphrase other answers)
  • the second approach eats a lot of memory if the file is big
  • in both cases, make sure that you open the file in binary mode, or end-of-line conversion could lead to wrong checksum (external tools would compute a different MD5 than your program)
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219