3

One of the answers for this question says that the following is a good way to read a large binary file without reading the whole thing into memory first:

 with open(image_filename, 'rb') as content:
     for line in content:
         #do anything you want

I thought the whole point of specifying 'rb' is that the line endings are ignored, therefore how could for line in content work?

Is this the most "Pythonic" way to read a large binary file or is there a better way?

Community
  • 1
  • 1
Startec
  • 12,496
  • 23
  • 93
  • 160
  • I just posted your question as a comment below the answer in that question. That seems better than asking a new question. – Barmar Jul 25 '15 at 23:44
  • @Ah thanks, what should I do with this question? – Startec Jul 25 '15 at 23:45
  • Well, it's too late to delete it, since someone answer. – Barmar Jul 25 '15 at 23:45
  • Possibly a [duplicate](http://stackoverflow.com/questions/4566498/python-file-iterator-over-a-binary-file-with-newer-idiom) – Pynchia Jul 25 '15 at 23:47
  • Well all the answers are helpful, I can't accept an answer for 4 more minutes though, my apologies if it should have been a comment. – Startec Jul 25 '15 at 23:50
  • what does `line` contain? a string, composed of the bytes read in the file up to a specific character (\n), converted using a given encoding. Now, for a binary file, that would not really make any sense. – njzk2 Jul 26 '15 at 00:26

3 Answers3

4

I would write a simple helper function to read in the chunks you want:

def read_in_chunks(infile, chunk_size=1024):
    while True:
        chunk = infile.read(chunk_size)
        if chunk:
            yield chunk
        else:
            # The chunk was empty, which means we're at the end
            # of the file
            return

The use as you would for line in file like so:

with open(fn. 'rb') as f:
    for chunk in read_in_chunks(f):
        # do you stuff on that chunk...

BTW: I asked THIS question 5 years ago and this is a variant of an answer at that time...


You can also do:

from collections import partial
with open(fn,'rb') as f:
    for chunk in iter(functools.partial(f.read, numBytes),''):
Community
  • 1
  • 1
dawg
  • 98,345
  • 23
  • 131
  • 206
  • I am reading that question now. I guess this is kind of duplicate of that (sorry I didn't see that). As a follow up, how do you determine the right `chunk_size` – Startec Jul 25 '15 at 23:53
  • 1
    What is the characteristic of each chunk? How will you process it? Is the file too big to read in one go? When you have `for record in file:` there is usually some record like relationship in each `record` to the whole `file`. You need to say more. – dawg Jul 25 '15 at 23:55
  • 5 Years ago you were a "Python newbie"? – Startec Jul 25 '15 at 23:56
  • Indeed I was. Perl was my weapon before that and C before that. – dawg Jul 25 '15 at 23:58
3

Binary mode means that the line endings aren’t converted and that bytes objects are read (in Python 3); the file will still be read by “line” when using for line in f. I’d use read to read in consistent chunks instead, though.

with open(image_filename, 'rb') as f:
    # iter(callable, sentinel) – yield f.read(4096) until b'' appears
    for chunk in iter(lambda: f.read(4096), b''):
        …
Ry-
  • 218,210
  • 55
  • 464
  • 476
  • Why the size of 4096? – Startec Jul 25 '15 at 23:47
  • cause you have to pick a size ... it doesnt matter which (great answer minitech) – Joran Beasley Jul 25 '15 at 23:47
  • Well which does matter. Just not by much. You must have that much memory free to use all at once. Otherwise why chunk? Slurp up the whole file. The problem with minitech or Joran trying to tell you how big it should be is that they don't know your system requirements, environment, or use case. When in doubt try it out. Multiples of 2 are popular because they're easy for system to manage. – candied_orange Jul 26 '15 at 00:56
3

for line in fh will split at new lines regardless of how you open the file

often with binary files you consume them in chunks

CHUNK_SIZE=1024
for chunk in iter(lambda:fh.read(CHUNK_SIZE),""):
    do_something(chunk)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179