3

I am trying to generate a hash for a given file, in this case the hash function got to a binary file (.tgz file) and then generated an error. Is there a way I can read a binary file and generate a md5 hash of it?

The Error I am receiving is:

buffer = buffer.decode('UTF-8') UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbc in position 10: invalid start byte

The source code is:

import hashlib

def HashFile(filename, readBlockSize = 4096):
    hash = hashlib.md5()

    with open(filename, 'rb') as fileHandle:

        while True:
            buffer = fileHandle.read(readBlockSize)

            if not buffer:
                break

            buffer = buffer.decode('UTF-8')                
            hash.update(hashlib.md5(buffer).hexdigest())

    return

I am using Python 3.7 on Linux.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
Swatcat
  • 73
  • 6
  • 21
  • 57

2 Answers2

3

There are a couple of things you can tweak here.

You don't need to decode the bytes returned by .read(), because md5() is expecting bytes in the first place, not str:

>>> import hashlib
>>> h = hashlib.md5(open('dump.rdb', 'rb').read()).hexdigest()
>>> h
'9a7bf9d3fd725e8b26eee3c31025b18e'

This means you can remove the line buffer = buffer.decode('UTF-8') from your function.

You'll also need to return hash if you want to use the results of the function.

Lastly, you need to pass the raw block of bytes to .update(), not its hex digest (which is a str); see the docs' example.

Putting it all together:

def hash_file(filename: str, blocksize: int = 4096) -> str:
    hsh = hashlib.md5()
    with open(filename, "rb") as f:
        while True:
            buf = f.read(blocksize)
            if not buf:
                break
            hsh.update(buf)
    return hsh.hexdigest()

(The above is an example using a Redis .rdb dump binary file.)

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
  • I am trying to read the file (as I could potentially be hashing very large files) in 4k blocks. Removing the decode I still get 'TypeError: Unicode-objects must be encoded before hashing' – Swatcat Apr 10 '19 at 14:22
  • 1
    That's because you're also calling `.hexdigest()`, which produces a `str`. I'll update my answer. – Brad Solomon Apr 10 '19 at 14:23
  • Thank you! That has solved the issue. It looks like I was using a 2.7 example. – Swatcat Apr 10 '19 at 14:35
1

This is a pythonic way NOT to slurp binary-files:

import hashlib
import io
import os


def hash_file(
        fpath: os.PathLike,
        digester_factory=hashlib.md5,
        chunk_size=io.DEFAULT_BUFFER_SIZE
) -> bytes: 
    digester = digester_factory()
    with open(fpath, "rb") as file:
        for chunk in iter(lambda: file.read(chunk_size), b''):
            digester.update(chunk)
    return digester.digest()

> hash_file("some/file.bin").hex()
b8e2d24ea2d0c722353e65f930153f85
> hash_file("some/file.bin").hex(' ', 2)
b8e2 d24e a2d0 c722 353e 65f9 3015 3f85

Some notes:

  • use iter(callable, sentinel) to chunk input-file (and avoid ugly while-loops).

  • consult Python interpreter about the preferred buffer-size, with io.DEFAULT_BUFFER_SIZE.

  • support multiple hashing-functions with a factory pattern.

  • prefer to return digest() -> bytes which supports eg bytes.hex(sep, group_nbytes) (instead of hexdigest() -> str) which maybe of aid to humans, when comparing binary numbers (though hashes are guaranteed to be completely different)

  • TODO: refactor not to expect file-paths but binary-streams (type-hint: typing.BinaryIO), to support also reading STDIN, like:

    > with open(fpath, `rb`) as file: hash_stream(file)
    > hash_stream(sys.stdin.buffer)
    
ankostis
  • 8,579
  • 3
  • 47
  • 61