Reading an entire binary file into Python

Question

I need to import a binary file from Python -- the contents are signed 16-bit integers, big endian.

The following Stack Overflow questions suggest how to pull in several bytes at a time, but is this the way to scale up to read in a whole file?

I thought to create a function like:

from numpy import *
import os

def readmyfile(filename, bytes=2, endian='>h'):
    totalBytes = os.path.getsize(filename)
    values = empty(totalBytes/bytes)
    with open(filename, 'rb') as f:
        for i in range(len(values)):
            values[i] = struct.unpack(endian, f.read(bytes))[0]
    return values

filecontents = readmyfile('filename')

But this is quite slow (the file is 165924350 bytes). Is there a better way?

Reading a 150mb file is is going to be slow. What do you expect? How slow is it? — Falmarri, Dec 12 '10 at 19:55
It's actually only about 3.5 minutes (according to unix `time`) but I can read it into R in less than a minute using `readBin` — hatmatrix, Dec 12 '10 at 20:12
Are the data clearly binary or the ASCII representation of 16 bit numbers? — the wolf, Dec 12 '10 at 21:30
In UNIX, type `head [filename]` If you can read the umbers, you have ASCII or other text vs "binary". — the wolf, Dec 13 '10 at 06:46
Yes -- that's what I'd done and I could see the x\215l... so I'd assumed it was the ASCII representation? That changes anything? — hatmatrix, Dec 13 '10 at 08:01

score 10 · Answer 1 · answered Dec 12 '10 at 20:02

10

Use numpy.fromfile.

answered Dec 12 '10 at 20:02

Karl Knechtel

62,466
11
102
153

1

is it as easy as `fromfile(filename,dtype='>i2')` ? – hatmatrix Dec 12 '10 at 20:19
1

@Stephen, yes, that's all that you need to do. If Karl had put that in the answer, then this is the best and simplest answer for this. – Justin Peel Dec 12 '10 at 20:41
In my experience numpy.fromfile is extremely fast and very easy to use. – Andrew Dec 22 '10 at 21:38

rob · Accepted Answer · 2010-12-12T20:26:03.243

4

I would directly read until EOF (it means checking for receiving an empty string), removing then the need to use range() and getsize.
Alternatively, using xrange (instead of range) should improve things, especially for memory usage.
Moreover, as Falmarri suggested, reading more data at the same time would improve performance quite a lot.

That said, I would not expect miracles, also because I am not sure a list is the most efficient way to store all that amount of data.
What about using NumPy's Array, and its facilities to read/write binary files? In this link there is a section about reading raw binary files, using numpyio.fread. I believe this should be exactly what you need.

Note: personally, I have never used NumPy; however, its main raison d'etre is exactly handling of big sets of data - and this is what you are doing in your question.

edited Dec 12 '10 at 20:26

answered Dec 12 '10 at 20:01

rob

36,896
2
55
65

I'll look into NumPy, but how do I parse it once it's loaded? Or parse it in a loop? Thanks~ – hatmatrix Dec 12 '10 at 20:15
In the link provided above there's a section about reading binary files, using fread. I'll update the original answer, to better specify this. – rob Dec 12 '10 at 20:23
Thanks -- that link was helpful. – hatmatrix Dec 12 '10 at 20:30

score 2 · Answer 3 · edited Jun 26 '15 at 18:40

I have had the same kind of problem, although in my particular case I have had to convert a very strange binary format (500 MB) file with interlaced blocks of 166 elements that were 3-bytes signed integers; so I've had also the problem of converting from 24-bit to 32-bit signed integers that slow things down a little.

I've resolved it using NumPy's memmap (it's just a handy way of using Python's memmap) and struct.unpack on large chunk of the file.

With this solution I'm able to convert (read, do stuff, and write to disk) the entire file in about 90 seconds (timed with time.clock()).

I could upload part of the code.

score 2 · Answer 4 · answered Dec 12 '10 at 19:56

2

You're reading and unpacking 2 bytes at a time

values[i] = struct.unpack(endian,f.read(bytes))[0]

Why don't you read like, 1024 bytes at a time?

answered Dec 12 '10 at 19:56

Falmarri

47,727
41
151
191

If I do that, how will it know that it's a 16-bit integer rather than 32 or something else? – hatmatrix Dec 12 '10 at 20:13
...and it gives me an error: `TypeError: Struct() argument 1 must be string, not int` – hatmatrix Dec 12 '10 at 20:14
I don't think struct.unpack is the best solution here. That's meant to take in strings, not binary files. – Falmarri Dec 12 '10 at 20:26

score 1 · Answer 5 · edited May 03 '22 at 13:02

I think the bottleneck you have here is twofold.

Depending on your OS and disc controller, the calls to f.read(2) with f being a bigish file are usually efficiently buffered -- usually. In other words, the OS will read one or two sectors (with disc sectors usually several KB) off disc into memory because this is not a lot more expensive than reading 2 bytes from that file. The extra bytes are cached efficiently in memory ready for the next call to read that file. Don't rely on that behavior -- it might be your bottleneck -- but I think there are other issues here.

I am more concerned about the single byte conversions to a short and single calls to numpy. These are not cached at all. You can keep all the shorts in a Python list of ints and convert the whole list to numpy when (and if) needed. You can also make a single call struct.unpack_from to convert everything in a buffer vs one short at a time.

Consider:

#!/usr/bin/python

import random
import os
import struct
import numpy
import ctypes
     
def read_wopper(filename,bytes=2,endian='>h'):
    buf_size=1024*2
    buf=ctypes.create_string_buffer(buf_size)
    new_buf=[]
    
    with open(filename,'rb') as f:
        while True:
            st=f.read(buf_size)
            l=len(st)
            if l==0: 
                break
            fmt=endian[0]+str(l/bytes)+endian[1]    
            new_buf+=(struct.unpack_from(fmt,st))
            
    na=numpy.array(new_buf)        
    return na
       
fn='bigintfile'

def createmyfile(filename):
    bytes=165924350
    endian='>h'
    f=open(filename,"wb")
    count=0
    
    try: 
        for int in range(0,bytes/2):
            # The first 32,767 values are [0,1,2..0x7FFF] 
            # to allow testing the read values with new_buf[value<0x7FFF]
            value=count if count<0x7FFF else random.randint(-32767,32767)
            count+=1
            f.write(struct.pack(endian,value&0x7FFF))
            
    except IOError:
        print "file error"
        
    finally:
        f.close()
        
if not os.path.exists(fn):
    print "creating file, don't count this..."
    createmyfile(fn)
else:    
    read_wopper(fn)
    print "Done!"

I created a file of random shorts signed ints of 165,924,350 bytes (158.24 MB) which comports to 82,962,175 signed 2 byte shorts. With this file, I ran the read_wopper function above and it ran in:

real        0m15.846s
user        0m12.416s
sys         0m3.426s

If you don't need the shorts to be numpy, this function runs in 6 seconds. All this on OS X, python 2.6.1 64 bit, 2.93 gHz Core i7, 8 GB ram. If you change buf_size=1024*2 in read_wopper to buf_size=2**16 the run time is:

real        0m10.810s
user        0m10.156s
sys         0m0.651s

So your main bottle neck, I think, is the single byte calls to unpack -- not your 2 byte reads from disc. You might want to make sure that your data files are not fragmented and if you are using OS X that your free disc space (and here) is not fragmented.

Edit I posted the full code to create then read a binary file of ints. On my iMac, I consistently get < 15 secs to read the file of random ints. It takes about 1:23 to create since the creation is one short at a time.

Thanks -- will try this out tomorrow, though currently using numpy.filefrom -- but this could be great for machines without numpy installed (which is not trivial for all the different machines I administer)! — hatmatrix, Dec 13 '10 at 08:21
Hmm... still 2m50s (OS X, Python 2.6 64-bit, 4GB of RAM)... thanks for the insight on the cacheing tho'~ — hatmatrix, Dec 13 '10 at 08:43
@Stephen: Is there something unusual about your disc? Is the disc format NTFS or really full or fragmented? If it is NTFS, the OS X NTFS driver is not fast. I will post my full code, and try it on a relatively empty HFS drive... — the wolf, Dec 13 '10 at 17:42
Strange, it's OS X Extended (Journaled) but bash-3.2$ time python test.py creating file, don't count this... real 2m28.376s user 2m6.882s sys 0m3.664s bash-3.2$ time python test.py Done! real 0m28.485s user 0m23.273s sys 0m1.509s — hatmatrix, Dec 14 '10 at 10:07
oops, that didn't format well -- in any case, writing took longer but reading was quicker. I wonder if there is the something different about the binary file. Unix head seems to freeze(?) on bigintfile whereas it doesn't on my other file. I appreciate all of your input... — hatmatrix, Dec 14 '10 at 10:09
@Stephen: OK, your times are near mine. Your "create" time is `2m28s` total time vs `1m33s` on my system. I just wrote something quick and dirty to create the file and it writes 1 short at a time. More importantly: `read_wopper` takes `28.5s` to read and convert to a numpy array vs `15.0s` on my system. The difference is probably memory and a newer, faster, emptier disc on my sys. You can make yours faster by increasing the `buf_size` to a larger number and make sure the file is not fragmented. With `buf_size=2**16` - read_wopper function takes `10.8s`. How long does `numpy.filefrom` take? — the wolf, Dec 14 '10 at 18:00
@Stephen: I really suggest looking at your disc as a potential issue look [HERE](http://gigaom.com/apple/disk-fragmentation-os-x-when-does-it-become-a-problem/) [HERE](http://osxbook.com/book/bonus/chapter12/hfsdebug/fragmentation.html) and look at if your disc has contiguous blocks big enough for the memory intensive thing you are trying to do. OS X Only autodefrags small files on the boot volume; large files or files on an external drive can get very fragmented as can free space on the disc. Either case will cause very slow access on that file. — the wolf, Dec 14 '10 at 18:17
@carrot-top. Thanks so much -- I do have to clean up my computer a bit. I'm testing these scripts on a pretty worn-down machine with 105/110 GB filled up. But numpy.filefrom takes only 0m6.458s on the same machine -- and seems to read it in correctly, which is pretty amazing... — hatmatrix, Dec 15 '10 at 12:09
@Stephen: `Thanks so much`: You are welcome! `105/110 GB filled up...`: Yes, issue in more ways than one! `numpy.filefrom takes only 0m6.458s`: That's your solution then! cheerio ;-]] — the wolf, Dec 15 '10 at 16:00

Reading an entire binary file into Python

5 Answers5

Linked