35

With a text file, I can write this:

with open(path, 'r') as file:
    for line in file:
        # handle the line

This is equivalent to this:

with open(path, 'r') as file:
    for line in iter(file.readline, ''):
        # handle the line

This idiom is documented in PEP 234 but I have failed to locate a similar idiom for binary files.

With a binary file, I can write this:

with open(path, 'rb') as file:
    while True:
        chunk = file.read(1024 * 64)
        if not chunk:
            break
        # handle the chunk

I have tried the same idiom that with a text file:

def make_read(file, size):
    def read():
        return file.read(size)
    return read

with open(path, 'rb') as file:
    for chunk in iter(make_read(file, 1024 * 64), b''):
        # handle the chunk

Is it the idiomatic way to iterate over a binary file in Python?

Géry Ogam
  • 6,336
  • 4
  • 38
  • 67
dawg
  • 98,345
  • 23
  • 131
  • 206

5 Answers5

40

Try:

chunk_size = 4 * 1024 * 1024  # MB

with open('large_file.dat','rb') as f:
    for chunk in iter(lambda: f.read(chunk_size), b''):
        handle(chunk)

iter needs a function with zero arguments.

  • a plain f.read would read the whole file, since the size parameter is missing;
  • f.read(1024) means call a function and pass its return value (data loaded from file) to iter, so iter does not get a function at all;
  • (lambda:f.read(1234)) is a function that takes zero arguments (nothing between lambda and :) and calls f.read(1234).
Gringo Suave
  • 29,931
  • 6
  • 88
  • 75
liori
  • 40,917
  • 13
  • 78
  • 105
  • Yeah, the sentinel trick with iter() is really neat! (Although I don't like lambdas, so I would have made a function). – Lennart Regebro Dec 30 '10 at 21:57
  • That works! Thanks. It is hard loosing old idioms (Perl) and learn new ones while still being reasonably productive. – dawg Dec 30 '10 at 21:57
  • This works... but it's a bit difficult to read in my opinion. – Jason Baker Dec 30 '10 at 21:58
  • @Lennart Regebro, @Jason Baker: I assumed OP wants to learn why his `iter` call didn't work. Writing an iterator is what I probably also would do in this case, unless working in an interactive prompt. – liori Dec 30 '10 at 22:08
  • @liori - Good point. I missed the fact that the OP was iterating over `f.read`. – Jason Baker Dec 30 '10 at 22:29
  • 16
    `functools.partial(f.read, numBytes)` should work too in place of the `lambda` – Jochen Ritzel Dec 30 '10 at 22:44
  • 5
    The sentinel should be an empty bytestring, `b''`. String literals are Unicode objects in Python 3 or with `from __future__ import unicode_literals` in Python 2. – George V. Reilly Dec 01 '16 at 05:32
25

I don't know of any built-in way to do this, but a wrapper function is easy enough to write:

def read_in_chunks(infile, chunk_size=1024*64):
    while True:
        chunk = infile.read(chunk_size)
        if chunk:
            yield chunk
        else:
            # The chunk was empty, which means we're at the end
            # of the file
            return

Then at the interactive prompt:

>>> from chunks import read_in_chunks
>>> infile = open('quicklisp.lisp')
>>> for chunk in read_in_chunks(infile):
...     print chunk
... 
<contents of quicklisp.lisp in chunks>

Of course, you can easily adapt this to use a with block:

with open('quicklisp.lisp') as infile:
    for chunk in read_in_chunks(infile):
        print chunk

And you can eliminate the if statement like this.

def read_in_chunks(infile, chunk_size=1024*64):
    chunk = infile.read(chunk_size)
    while chunk:
        yield chunk
        chunk = infile.read(chunk_size)
Jason Baker
  • 192,085
  • 135
  • 376
  • 510
  • I had assumed there was some built-in way that I was just overlooking. Since there seems to not be a built-in way, this is is easy to read and straight forward. Thanks! – dawg Dec 30 '10 at 22:16
12

The Pythonic way to read a binary file iteratively is using the built-in function iter with two arguments and the standard function functools.partial, as described in the Python library documentation:

iter(object[, sentinel])

Return an iterator object. The first argument is interpreted very differently depending on the presence of the second argument. Without a second argument, object must be a collection object which supports the iteration protocol (the __iter__() method), or it must support the sequence protocol (the __getitem__() method with integer arguments starting at 0). If it does not support either of those protocols, TypeError is raised. If the second argument, sentinel, is given, then object must be a callable object. The iterator created in this case will call object with no arguments for each call to its __next__() method; if the value returned is equal to sentinel, StopIteration will be raised, otherwise the value will be returned.

See also Iterator Types.

One useful application of the second form of iter() is to build a block-reader. For example, reading fixed-width blocks from a binary database file until the end of file is reached:

from functools import partial

with open('mydata.db', 'rb') as f:
    for block in iter(partial(f.read, 64), b''):
        process_block(block)
Géry Ogam
  • 6,336
  • 4
  • 38
  • 67
6

Nearly 10 years after this question and now Python 3.8 has the := Walrus Operator described in PEP 572.

To read a file in chunks idiomatically and expressively (with Python 3.8 or later) you can do:

# A loop that cannot be trivially rewritten using 2-arg iter().
while chunk := file.read(1024 * 64):
    process(chunk)
Géry Ogam
  • 6,336
  • 4
  • 38
  • 67
dawg
  • 98,345
  • 23
  • 131
  • 206
  • i got while chunk := input_file.read(1024 * 64): ^ SyntaxError: invalid syntax – user1 Feb 10 '21 at 01:24
  • Are you using Python 3.8+? – dawg Feb 22 '21 at 14:31
  • Why can't that loop be trivially rewritten with 2-art iter? Other answers seem to do exactly that – Osman-pasha Mar 22 '22 at 09:01
  • I would agree with @Osman-pasha that it's untrue, or at least a stretch, that this can't be trivially rewritten with 2-arg `iter()`. Depends what you mean by trivially I suppose. But I certainly agree this is a lot simpler to read - rather than composing a new function (with lambda or partial) then passing that to another function to call as a callback - this simply just calls the function! – Arthur Tacca Nov 17 '22 at 10:57
-2

In Python 3.8+, there is a new assignment expression := - known as the "walrus operator" - that assigns values to variables. See PEP 572 for more details. Thus, to read a file in chunks, you could do:

def read_in_chunks(file_path, chunk_size=1024):
    with open(file_path, 'rb') as f:
        while chunk := f.read(chunk_size):
            yield chunk  # or process the chunk as desired
Chris
  • 18,724
  • 6
  • 46
  • 80