1

I have the following code

def GetSHA256(filename, size = 2 ** 10):

    import hashlib

    h = hashlib.sha256()

    with open(filename, 'rb') as f:
        for byte_block in iter(lambda: f.read(size * h.block_size), b""):
            h.update(byte_block)
        return h.hexdigest()

I want to choose an optimal chunk-size. However, from what I could find people tend to optimise by hand. E.g. here and here. Is there a way how this can be done better? Or is there a library that has thought about this question?

Tom de Geus
  • 5,625
  • 2
  • 33
  • 77

1 Answers1

1

Have you looked into io.DEFAULT_BUFFER_SIZE?

Per the docs for open():

buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:\

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

The default behavior is buffering=-1, so open() is more that likely reading the file in buffered 8192 chunks anyway.

Timus
  • 10,974
  • 5
  • 14
  • 28
Terry Spotts
  • 3,527
  • 1
  • 8
  • 21