I need a way to efficiently store (size & read speed) data using numpy arrays with mixed (heterogeneous) dtypes. Imagine a dataset that has 100M observations, and 5 variables per observation (3 of which are int32, and 2 are float32).
I'm currently storing the data in two gzipped .npy files, one for the ints and one for the floats:
import numpy as np
import gzip as gz
with gz.open('array_ints.npy.gz', 'wb') as fObj:
np.save(fObj, int_ndarray)
with gz.open('array_floats.npy.gz', 'wb') as fObj:
np.save(fObj, flt_ndarray)
I've also tried storing the data as a Structured Array, but the final file size is roughly 25% larger than the combined size of storing the ints and floats separately. My data is stretching into the TBs range, so I'm looking for the most efficient way to store it (but I'd like to avoid changing compression algos to something like LZMA).
Is there another way different data types are efficiently stored together, so I can read in both at the same time? I'm starting to look into HD5, but I'm not sure that can help.
EDIT:
Ultimately, I ended up going down the HD5 route with h5py. Relative to gzip-compressed .npy arrays, I actually see a 25% decrease in size using h5py. However, this can be attributed to the shuffle filter. But when saving two arrays in the same file, there is virtually no overhead relative to saving individual files.
I realize that the original question was too broad, and sufficient answers can't be given without the specific format of the data and a representative sample (which I can't really disclose). For this reason, I'm closing the question.