Automatically choose chunk-size optimal for disk/system

Question

I have the following code

def GetSHA256(filename, size = 2 ** 10):

    import hashlib

    h = hashlib.sha256()

    with open(filename, 'rb') as f:
        for byte_block in iter(lambda: f.read(size * h.block_size), b""):
            h.update(byte_block)
        return h.hexdigest()

I want to choose an optimal chunk-size. However, from what I could find people tend to optimise by hand. E.g. here and here. Is there a way how this can be done better? Or is there a library that has thought about this question?

score 1 · Answer 1 · edited Dec 01 '20 at 12:17

Have you looked into io.DEFAULT_BUFFER_SIZE?

Per the docs for open():

buffering is an optional integer used to set the buffering policy. Pass 0 to switch buffering off (only allowed in binary mode), 1 to select line buffering (only usable in text mode), and an integer > 1 to indicate the size in bytes of a fixed-size chunk buffer. When no buffering argument is given, the default buffering policy works as follows:\

Binary files are buffered in fixed-size chunks; the size of the buffer is chosen using a heuristic trying to determine the underlying device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On many systems, the buffer will typically be 4096 or 8192 bytes long.

The default behavior is buffering=-1, so open() is more that likely reading the file in buffered 8192 chunks anyway.

Thanks! Can you comment how that relates to my question of efficiency? Is that exactly what the module does? — Tom de Geus, Dec 01 '20 at 10:54

Automatically choose chunk-size optimal for disk/system

1 Answers1