6

I need a way to efficiently store (size & read speed) data using numpy arrays with mixed (heterogeneous) dtypes. Imagine a dataset that has 100M observations, and 5 variables per observation (3 of which are int32, and 2 are float32).

I'm currently storing the data in two gzipped .npy files, one for the ints and one for the floats:

import numpy as np
import gzip as gz

with gz.open('array_ints.npy.gz', 'wb') as fObj:
    np.save(fObj, int_ndarray)

with gz.open('array_floats.npy.gz', 'wb') as fObj:
    np.save(fObj, flt_ndarray)

I've also tried storing the data as a Structured Array, but the final file size is roughly 25% larger than the combined size of storing the ints and floats separately. My data is stretching into the TBs range, so I'm looking for the most efficient way to store it (but I'd like to avoid changing compression algos to something like LZMA).

Is there another way different data types are efficiently stored together, so I can read in both at the same time? I'm starting to look into HD5, but I'm not sure that can help.

EDIT:
Ultimately, I ended up going down the HD5 route with h5py. Relative to gzip-compressed .npy arrays, I actually see a 25% decrease in size using h5py. However, this can be attributed to the shuffle filter. But when saving two arrays in the same file, there is virtually no overhead relative to saving individual files.

I realize that the original question was too broad, and sufficient answers can't be given without the specific format of the data and a representative sample (which I can't really disclose). For this reason, I'm closing the question.

user1554752
  • 707
  • 2
  • 10
  • 24
  • You could save them as a tuple (or list) in a `pickle` file. e.g. `(int_ndarray, flt_ndarray)` – berkelem Mar 29 '19 at 13:02
  • https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez_compressed.html numpy does have an internal zip compressor. – user1462442 Mar 29 '19 at 13:04
  • 3
    Something else to consider is the level of precision you need. Do your integers have 64-bit precision (do they NEED 64-bit precision); can they be reduced to 16-bit, 8-bit? Can you apply a scaling factor on your floats to enable conversion to lower-precision int (e.g. if your float data has 1 decimal, multiply by 10 and convert to int). You MIGHT be able to significantly reduce the memory size of your data by making these types of conversions. – tnknepp Mar 29 '19 at 13:10
  • One more thing, you can use gzip to compress your data while dumping to pickle: http://henrysmac.org/blog/2010/3/15/python-pickle-example-including-gzip-for-compression.html – tnknepp Mar 29 '19 at 13:14
  • 1
    `np.savez_compressed` saves several arrays to a compressed zip archive. I don't know how that compression compares to `gzip`. `HDF5` also has compression filters, and may be better if you want to fetch parts of the arrays. But for compressing the whole arrays it probably isn't more compact - there's more 'database' like overhead. `np.save` writes the databuffer of the array to the file (plus a small definition buffer), so the size is the same as in memory. `pickle` uses `np.save`, so file size is similar. – hpaulj Mar 29 '19 at 15:25
  • Gzip is very slow. You can just use one of the Blosc algorithms https://github.com/Blosc/python-blosc (should provide well beyond 1GB/s compressuon speed, depending on your hardware) An example of using HDF5 efficiently: https://stackoverflow.com/a/48997927/4045774 For a real answer more information is needed (How is the data read/written? Normally you don't read/write datasets >1TB in one step...) – max9111 Mar 30 '19 at 17:07
  • 1
    `gzip` won't work for `float32` since their binary representation will be essentially random. The best way to compress floating points is to reduce precision to `float16` or `float8` as @tnknepp mentioned – M1L0U Apr 06 '19 at 03:43
  • _"I've also tried storing the data as a Structured Array, but the final file size is roughly 25% larger"_ - this is due the data becoming less compressible, presumably - the uncompressed file will be marginally smaller than the two constituent uncompressed files, due to having a single header. – Eric Apr 15 '19 at 02:47

0 Answers0