What is the most efficient way to read a large binary file python

Question

I have a large (21 GByte) file which I want to read into memory and then pass to a subroutine which processes the data transparently to me. I am on python 2.6.6 on Centos 6.5 so upgrading the operating system or python is not an option. Currently, I am using

f = open(image_filename, "rb")
image_file_contents=f.read()
f.close()
transparent_subroutine ( image_file_contents )

which is slow (~15 minutes). Before I start reading the file, I know how big the file is, because I call os.stat( image_filename ).st_size

so I could pre-allocate some memory if that made sense.

Thank you

A larger buffer may help `open(image_filename, 'rb', 64*1024*1024)` — tdelaney, Sep 09 '14 at 23:11
How do you plan on accessing the data? Random access? Read a block, process, repeat? Or do you actually need the entire file in mapped in memory? — xavier, Sep 10 '14 at 00:29
I don't how the data is accessed. It is the input to the openstack program glance, which uses it to create a volume. I haven't tried changing the buffer size, that's clever. — Jeff Silverman, Sep 12 '14 at 04:27

score 3 · Answer 1 · answered Apr 16 '20 at 12:15

3

using a generator

def generator(file_location):

    with open(file_location, 'rb') as entry:

        for chunk in iter(lambda: entry.read(1024 * 8), b''):

            yield chunk


go_to_streaming = generator(file_location)

answered Apr 16 '20 at 12:15

sandes

1,917
17
28

score 1 · Answer 2 · answered Mar 20 '15 at 07:31

To follow Dietrich's suggestion, I measure this mmap technique is 20% faster than one big read for a 1.7GB input file

from zlib import adler32 as compute_cc

n_chunk = 1024**2
crc = 0
with open( fn ) as f:
  mm = mmap.mmap( f.fileno(), 0, prot = mmap.PROT_READ, flags = mmap.MAP_PRIVATE )
  while True:
    buf = mm.read( n_chunk )
    if not buf: break
    crc = compute_crc( buf, crc )
return crc

score 0 · Answer 3 · answered Mar 25 '23 at 08:39

I know I asked this question 9 years ago however in reviewing it, I had an insight that I did not have when I asked the question.

The answer is very dependent on the size of memory available to your process. Your sysadmin may restrict the amount of virtual memory the OS will give you (using ulimit, which is saved in /etc/security), in which case trying to store something that large is going to fail. However, if have less that 21 GBytes of physical memory (there might be another limit I am either unaware of or have forgotten), then while reading in 21 GBytes of data, the process will start page faulting. At first, the Linux memory manager will try to page fault to pages that are resident but not in use at the moment. In desperation, the memory manager will start paging to and from the disk. A memory access will typically take less than 10 nanoseconds (with modern RAM). A "clean" page fault is when the virtual page requested is not in the working set but is resident in physical memory; that might take a few microseconds. A "dirty" page fault is when the virtual page is not resident anywhere in physical memory and must be brought in from the disk; that might take several milliseconds. Take these numbers with a bit of skepticism because there are all sorts of configuration-dependent variables that will affect them.

What is the most efficient way to read a large binary file python

3 Answers3

Linked