I have timeseries data in sequential (packed c-struct) format in very large files. Each structure contain K fields of different types in some order. The file is essentially an array of these structures (row-wise). I would like to be able to mmap the file and map each field to a numpy array (or another form) where the array recognizes a stride (the size of the struct) aliased as columns in a dataframe.
An example struct might be:
struct {
int32_t a;
double b;
int16_t c;
}
Such a file of records could be generated with python as:
from struct import pack, unpack
db = open("binarydb", "wb")
for i in range(1,1000):
packed = pack('<idh', i, i*3.14, i*2)
db.write(packed)
db.close()
The question is then how to view such a file efficiently as a dataframe. If we assume the file is hundreds of millions of rows in length would need to use a mem-map solution.
Using memmap, how can i map a numpy array (or alternative array structure) to the sequence of integers for column a. It seems to me that would need to be able to indicate a stride (14 bytes) and offset (0 in this case) for the int32 series "a", an offset of 4 for the float64 series "b", and an offset of 12 for the int16 series "c".
I have seen that one can easily create a numpy array against a mmap'ed file if the file contains a single dtype. Is there a way to pull the different series in this file by indicating a type, offset, and stride? With this approach could present mmapped columns to pandas or another dataframe implementation.
Even better, is there a simple way to integrate a custom mem-mapped format into Dask, such that get the benefits of lazy paging into the file?