I have a huge binary file(several GB) that has the following dataformat:
4 subsequent bytes form one composite datapoint(32 bits) which consists of:
b0-b3 4 flag bits
b4-b17 14 bit signed integer
b18-b32 14 bit signed integer
I need to access both signed integers and the flag bits separately and append to a list or some smarter datastructure (not yet decided). At the moment I'm using the following code to read it in:
from collections import namedtuple
DataPackage = namedtuple('DataPackage', ['ie', 'if1', 'if2', 'if3', 'quad2', 'quad1'])
def _unpack_integer(bits):
value = int(bits, 2)
if bits[0] == '1':
value -= (1 << len(bits))
return value
def unpack(data):
bits = ''.join(['{0:08b}'.format(b) for b in bytearray(data)])
flags = [bool(bits[i]) for i in range(4)]
quad2 = _unpack_integer(bits[4:18])
quad1 = _unpack_integer(bits[18:])
return DataPackage(flags[0], flags[1], flags[2], flags[3], quad2, quad1)
def read_file(filename, datapoints=None):
data = []
i = 0
with open(filename, 'rb') as fh:
value = fh.read(4)
while value:
dp = unpack(value)
data.append(dp)
value = fh.read(4)
i += 1
if i % 10000 == 0:
print('Read: %d kB' % (float(i) * 4.0 / 1000.0))
if datapoints:
if i == datapoints:
break
return data
if __name__ == '__main__':
data = read_heterodyne_file('test.dat')
This code works but it's too slow for my purposes (2s for 100k datapoints with 4byte each). I would need a factor of 10 in speed at least.
The profiler says that the code spends it's time mostly in string formatting(to get the bits) and in _unpack_integer().
Unfortunately I am not sure how to proceed here. I'm thinking about either using cython or directly writing some c code to do the read in. I also tried Pypy ant it gave me huge performance gain but unfortunately it needs to be compatible to a bigger project which doesn't work with Pypy.