5

It was recently asked how to do a file slurp in python, and the accepted answer suggested something like:

with open('x.txt') as x: f = x.read()

How would I go about doing this to read the file in and convert the endian representation of the data?

For example, I have a 1GB binary file that's just a bunch of single precision floats packed as a big endian and I want to convert it to little endian and dump into a numpy array. Below is the function I wrote to accomplish this and some real code that calls it. I use struct.unpack do the endian conversion and tried to speed everything up by using mmap.

My question then is, am I using the slurp correctly with mmap and struct.unpack? Is there a cleaner, faster way to do this? Right now what I have works, but I'd really like to learn how to do this better.

Thanks in advance!

#!/usr/bin/python
from struct import unpack
import mmap
import numpy as np

def mmapChannel(arrayName,  fileName,  channelNo,  line_count,  sample_count):
    """
    We need to read in the asf internal file and convert it into a numpy array.
    It is stored as a single row, and is binary. Thenumber of lines (rows), samples (columns),
    and channels all come from the .meta text file
    Also, internal format files are packed big endian, but most systems use little endian, so we need
    to make that conversion as well.
    Memory mapping seemed to improve the ingestion speed a bit
    """
    # memory-map the file, size 0 means whole file
    # length = line_count * sample_count * arrayName.itemsize
    print "\tMemory Mapping..."
    with open(fileName, "rb") as f:
        map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        map.seek(channelNo*line_count*sample_count*arrayName.itemsize)

        for i in xrange(line_count*sample_count):
            arrayName[0, i] = unpack('>f', map.read(arrayName.itemsize) )[0]

        # Same method as above, just more verbose for the maintenance programmer.
        #        for i in xrange(line_count*sample_count): #row
        #            be_float = map.read(arrayName.itemsize) # arrayName.itemsize should be 4 for float32
        #            le_float = unpack('>f', be_float)[0] # > for big endian, < for little endian
        #            arrayName[0, i]= le_float

        map.close()
    return arrayName

print "Initializing the Amp HH HV, and Phase HH HV arrays..."
HHamp = np.ones((1,  line_count*sample_count),  dtype='float32')
HHphase = np.ones((1,  line_count*sample_count),  dtype='float32')
HVamp = np.ones((1,  line_count*sample_count),  dtype='float32')
HVphase = np.ones((1,  line_count*sample_count),  dtype='float32')



print "Ingesting HH_Amp..."
HHamp = mmapChannel(HHamp, 'ALPSRP042301700-P1.1__A.img',  0,  line_count,  sample_count)
print "Ingesting HH_phase..."
HHphase = mmapChannel(HHphase, 'ALPSRP042301700-P1.1__A.img',  1,  line_count,  sample_count)
print "Ingesting HV_AMP..."
HVamp = mmapChannel(HVamp, 'ALPSRP042301700-P1.1__A.img',  2,  line_count,  sample_count)
print "Ingesting HV_phase..."
HVphase = mmapChannel(HVphase, 'ALPSRP042301700-P1.1__A.img',  3,  line_count,  sample_count)

print "Reshaping...."
HHamp_orig = HHamp.reshape(line_count, -1)
HHphase_orig = HHphase.reshape(line_count, -1)
HVamp_orig = HVamp.reshape(line_count, -1)
HVphase_orig = HVphase.reshape(line_count, -1)
Community
  • 1
  • 1
Foofy
  • 218
  • 3
  • 7
  • I wanted to add to this, for anyone else who finds this post useful. Running the original code I had takes about 80 seconds or so. Running the solution provided by Alex Martelli and J F Sebastian is less than a second. The program that calls this function does so many times. As such, the running time has dropped considerably. Thank you both for the help and for teaching me something =) – Foofy Nov 07 '09 at 21:03

4 Answers4

7

Slightly modified @Alex Martelli's answer:

arr = numpy.fromfile(filename, numpy.dtype('>f4'))
# no byteswap is needed regardless of endianess of the machine
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • You probably would like to combine that with .astype to get it into a native format, e.g. `arr = numpy.fromfile (filename, numpy.dtype ('>f4')).astype (np.float32)` – RolKau Aug 21 '15 at 22:18
  • @RolKau no. Try to run some code with and without the call and see what happens. – jfs Aug 22 '15 at 01:51
  • @JFSebastian Maybe I have extrapolated from this too far from what is the actual case in the question, but consider the Python code: `b = bytearray ([0, 0, 0, 1]); a = numpy.frombuffer (b, dtype = numpy.dtype ('>i4')); c = a.astype (numpy.int32); (a.tostring (), c.tostring ())`. On my platform (Linux, Python 2.7, x86_64) I get the results a = `\x00\x00\x00\x01`, c = `\x01\x00\x00\x00`, which I interpret that only c was stored internally as little-endian. – RolKau Aug 25 '15 at 09:26
  • @RolKau: say you have `3.14` value that is `b'@H\xf5\xc3'` in `'>f4'` then `np.frombuffer(b'@H\xf5\xc3', '>f4')[0]` is (close to) `3.14` i.e., the value is read correctly (if you were to read it as little-endian then the result would be `-490.56445`). The buffer (bytes) stays the same whatever format you use (you can interpret the same bytes as an integer). If you want to save it in some other format; you can but it is unrelated to the question. – jfs Aug 25 '15 at 12:11
6
with open(fileName, "rb") as f:
  arrayName = numpy.fromfile(f, numpy.float32)
arrayName.byteswap(True)

Pretty hard to beat for speed AND conciseness;-). For byteswap see here (the True argument means, "do it in place"); for fromfile see here.

This works as is on little-endian machines (since the data are big-endian, the byteswap is needed). You can test if that is the case to do the byteswap conditionally, change the last line from an unconditional call to byteswap into, for example:

if struct.pack('=f', 2.3) == struct.pack('<f', 2.3):
  arrayName.byteswap(True)

i.e., a call to byteswap conditional on a test of little-endianness.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • that is remarkably straightforward. thank you what's weird is i had seen those when trying to figure out how to do this, but it just didn't register for some reason. comes with experience i suppose =) – Foofy Oct 27 '09 at 20:39
  • 2
    numpy.float32 has native byte order that might not be always big-endian. http://stackoverflow.com/questions/1632673/python-file-slurp-w-endian-conversion/1633525#1633525 – jfs Oct 27 '09 at 20:44
  • Indeed it will mostly be little-endian, but if you're running e.g. on a Power PC machine it will be big endian (if that's an issue just conditionally omit the byteswap call -- let me edit the answer to add that bit). – Alex Martelli Oct 27 '09 at 21:38
  • 1
    Testing sys.byteorder is a little more straightforward than using struct.pack. – Jim Hunziker Mar 27 '11 at 02:00
0

You could coble together an ASM based solution using CorePy. I wonder though, if you might be able to gain enough performance from the some other part of your algorithm. I/O and manipulations on 1GB chunks of data are going to take a while which ever way you slice it.

One other thing you might find helpful would be to switch to C once you have prototyped the algorithm in python. I did this for manipulations on a whole-world DEM (height) data set one time. The whole thing was much more tolerable once I got away from the interpreted script.

Ewan Todd
  • 7,315
  • 26
  • 33
0

I'd expect something like this to be faster

arrayName[0] = unpack('>'+'f'*line_count*sample_count, map.read(arrayName.itemsize*line_count*sample_count))

Please don't use map as a variable name

John La Rooy
  • 295,403
  • 53
  • 369
  • 502