big data dataframe from an on-disk mem-mapped binary struct format from python, pandas, dask, numpy

Question

I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these structures (row-wise). I would like to be able to mmap the file and map each field to a numpy array (or another form) where the array recognizes a stride (the size of the struct) aliased as columns in a dataframe.

An example struct might be:

struct {
   int32_t a;
   double b;
   int16_t c;
}

Such a file of records could be generated with python as:

from struct import pack, unpack
db = open("binarydb", "wb")
for i in range(1,1000):
    packed = pack('<idh', i, i*3.14, i*2)
    db.write(packed)
db.close()

The question is then how to view such a file efficiently as a dataframe. If we assume the file is hundreds of millions of rows in length would need to use a mem-map solution.

Using memmap, how can i map a numpy array (or alternative array structure) to the sequence of integers for column a. It seems to me that would need to be able to indicate a stride (14 bytes) and offset (0 in this case) for the int32 series "a", an offset of 4 for the float64 series "b", and an offset of 12 for the int16 series "c".

I have seen that one can easily create a numpy array against a mmap'ed file if the file contains a single dtype. Is there a way to pull the different series in this file by indicating a type, offset, and stride? With this approach could present mmapped columns to pandas or another dataframe implementation.

Even better, is there a simple way to integrate a custom mem-mapped format into Dask, such that get the benefits of lazy paging into the file?

could you provide an [mre]? providing a general solution without a file type, let alone sample data or code, is a bit tough. It's easier for us to approach a specific implementation ask that you could generalize rather than asking a super general question that we have to cover all bases for (e.g. would definitely appreciate changing "K fields of different types in some order" --> "here is an example file") — Michael Delgado, Jun 23 '22 at 16:10
if it's something like a numpy array with a memmap interface you could possibly route this through dask array - see https://stackoverflow.com/questions/72663716/how-to-efficiently-convert-npy-to-xarray-zarr/72666271#72666271 — Michael Delgado, Jun 23 '22 at 16:13
@MichaelDelgado Hard to provide python code as am looking for said solution. That said, Imagine a binary file of 100 million pairs in sequence - i.e. the first 4 bytes are a little endian integer and the next 8 bytes are a 8 byte floating point #. The next record in the file is exactly the same, 4 byte int, 8 byte float. This is repeated 100 million times. Question is then, using memmap and python facilities, how do I extract the int32 series and then then float64 series. Each series is not contiguous, rather for the intr32 series, there is a 8 byte stride. — Jonathan Shore, Jun 23 '22 at 17:19
that would be great! at the moment I don't even know how to read a subset of your file in memory, so the dask solution is hard to write up :) if *you* don't know how to read your file into python at all, then I think you have a prerequisite question on your hands :P — Michael Delgado, Jun 23 '22 at 17:35
ahh i just saw the update. this is super low level - I expected something in numpy. can you demo the full workflow from this to an array or pd.DataFrame? How would you read into a pd.DataFrame given an arbitrary offset? that would be super helpful — Michael Delgado, Jun 23 '22 at 17:40
Note that is you read only few fields of the structure at a time, then reading this structure is not efficient dues to unused wasted spaces. For example a time filter will only read 8 bytes while the structure takes 40 bytes (due to padding) causing 80% of data read to be unused (and so a 5x slower execution than the optimal). — Jérôme Richard, Jun 23 '22 at 17:50

score 2 · Accepted Answer · answered Jun 23 '22 at 18:08

You can use numpy.memmap to do that. Since your input data type is not a native type, you need to use advanced Numpy data types. Note that you need the size of the array ahead of time since Numpy does not support unbounded streams but fixed-size array.

size = 999

datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int16)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='write', shape=size)

for i in range(1,1+size):
    data[i]['a'] = i
    data[i]['b'] = i*3.14
    data[i]['c'] = i*2

Note that vectorized operation are generally much faster than direct indexing in Numpy. Numba can also be used to speed up the direct indexing if the operation cannot be vectorized.

Note that memory mapped area can be flushed but not yet closed in Numpy.

This shows how to write to it, but pointed me in the right direction as to the read (the questions). Thx! Will post the read code below, but flagged yours as the answer. — Jonathan Shore, Jun 23 '22 at 19:05

score 2 · Answer 2 · answered Jun 23 '22 at 19:09

Extrapolating from @Jérôme Richard's answer above. Here is code to read from a binary sequence of records:

size = 999
datatype = np.dtype([('a', np.int32), ('b', np.float64), ('c', np.int32)])

# The final memory-mapped array
data = np.memmap("binarydb", dtype=datatype, mode='readonly', shape=size)

Can then pull each series as:

data['a']
data['b']
data['c']

big data dataframe from an on-disk mem-mapped binary struct format from python, pandas, dask, numpy

2 Answers2