How to append chunks of 2D numpy array to binary file as the chunks are created?

Question

I have a large input file which consists of data frames (a data series (complex64), with an identifying header in each frame). It is larger than my available memory. The headers repeat, but are randomly ordered, so for example the input file could look like:

<FRAME header={0}, data={**first** 500 numbers...}>,
<FRAME header={18}, data={first 500 numbers...}>,
<FRAME header={4}, data={first 500 numbers...}>,
<FRAME header={0}, data={**next** 500 numbers...}>
...

I want to order the data into a new file that is a numpy array of shape (len(headers), len(data_series)). It has to build the output file as it reads the frames, because I can't fit it all in memory.

I've looked at numpy.savetxt and the python csv package but for disk size, precision, and speed reasons I would prefer for the output file to be binary. numpy.save is good except that I can't figure out how to make it append to an unknown array size.

I have to work in Python2.7 because of some dependencies needed to read these frames. What I have done so far is made a function able to write all of the frames with a matching header to a single binary file:

input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)

 with open("singleFrameHeader", 'ab') as f:
     current_data = input_data.readFrame() # This loads the next frame in the file
     if current_data.header == 0:
          float_arr = np.array(current_data.data).view(float)
          float_arr.tofile(f)

This works great, but what I need to extend it to be two dimensional. I'm starting to look at h5py as an option, but was hoping there is a simpler solution.

What would be great is something like

input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)

 with open("bigMatrix", 'ab') as f:
     current_data = input_data.readFrame() # This loads the next frame in the file
     index = current_data.header
     float_arr = np.array(current_data.data).view(float)
     float_arr.tofile(f, index)

Any help is appreciated. I thought this would be a more common use-case to read and write to a 2D binary file in append mode.

`tofile` writes a flat binary array - just the contents of the data buffer. Array attributes like shape and dtype are not saved. So whether the array is 2d or ravelled, it writes the same thing. — hpaulj, Jul 22 '19 at 22:42
@nicholas. I've updated my answer to include that information. It is standard procedure to remove your question off the unanswered queue by clicking on the check mark next to an answer that answers your question. — Mad Physicist, Jul 23 '19 at 19:57

Mad Physicist · Accepted Answer · 2019-07-23T19:55:44.847

You have two problems: one is that a file contains sequential data, and the other is that numpy binary files don't store shape information.

A simple way to start solving this would be to carry through with your initial idea of converting the data into files by header, then combining all the binary files into one large product (if you still feel the need to do so).

You could maintain a map of the headers you've found so far to their output files, data size, etc. This will allow you to combine the data more intelligently, if for example, there are missing chunks or headers or something.

from contextlib import ExitStack
from os import remove
from tempfile import NamedTemporaryFile
from shutil import copyfileobj
import sys

class Header:
    __slots__ = ('id', 'count', 'file', 'name')
    def __init__(self, id):
        self.id = id
        self.count = 0
        self.file = NamedTemporaryFile(delete=False)
        self.name = self.file.name
    def write_frame(self, frame):
        data = np.array(frame.data).view(float)
        self.count += data.size
        data.tofile(self.file)

input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)
file_map = {}

with ExitStack() as stack:
    while True:
        frame = input_data.next_frame()
        if frame is None:
            break  # recast this loop as necessary
        if frame.header not in file_map:
            header = Header(frame.header)
            stack.enter_context(header.file)
            file_map[frame.header] = header
        else:
            header = file_map[frame.header]
        header.write_frame(frame)

max_header = max(file_map)
max_count = max(h.count for h in file_map)

with open('singleFrameHeader', 'wb') as output:
    output.write(max_header.to_bytes(8, sys.byteorder))
    output.write(max_count.to_bytes(8, sys.byteorder))
    for i in range max_header:
        if i in file_map:
            h = file_map[i]
            with open(h.name, 'rb') as input:
                copyfileobj(input, output)
            remove(h.name)
            if h.count < max_count:
                np.full(max_count - h.count, np.nan, dtype=np.float).tofile(output)
        else:
            np.full(max_count, np.nan, dtype=np.float).tofile(output)

The first 16 bytes will be the int64 number of headers and number of elements per header, respectively. Keep in mind that the file is in native byte order, whatever that may be, and is therefore not portable.

Alternative

If (and only if) you know the exact size of a header dataset ahead of time, you can do this in one pass, with no temporary files. It also helps if the headers are contiguous. Otherwise, missing swaths will be zero-filled. You will still need to maintain a dictionary of your current position within a header, but you will no longer have to keep a separate file pointer around for each one. All-in-all, this is a much better alternative than the original solution, if your use-case allows it:

header_size = 500 * N  # You must know this up front
input_data = Funky_Data_Reader_that_doesnt_matter(input_filename)

header_map = {}
with open('singleFrameHeader', 'wb') as output:
    output.write(max_header.to_bytes(8, sys.byteorder))
    output.write(max_count.to_bytes(8, sys.byteorder))
    while True:
        frame = input_data.next__frame()
        if frame is None:
            break
        if frame.header not in header_map:
            header_map[frame.header] = 0
        data = np.array(frame.data).view(float)
        output.seek(16 + frame.header * header_size + header_map[frame.header])
        data.tofile(output)
        header_map[frame.header] += data.size * data.dtype.itemsize

I asked a question regarding this sort of out-of-order write pattern as a consequence of this answer: What happens when you seek past the end of a file opened for writing?

How to append chunks of 2D numpy array to binary file as the chunks are created?

1 Answers1