0

I calculate very big arrays (about 100million integers, and this 9000 times) and the whole array doesn't fit in memory. So I write them to a .hdf5-file in chunks of a size that fits in my memory. Also I use "lzf" compression because the .hdf5 gets to big for my SSD.

After that I again read chunks from the hdf5-file rowwise and do some other calculations (which are only possible if all columns of the array are available for each row). Dimensions are about 100mil x 9000 then.

So to sum up:

calculate one column (100mil entrys) -> write to hdf5

read from hdf5 -> calculations on one row

The speed is kind of okay, but I don't know if there are better possibilites to speed this up. One additional information about the arrays I can give: there are sparse. So about 90% of the whole thing are zeros.

Thank you for your help.

  • Please add some information. What is the chunk-shape you write to the HDF5-dataset? How many RAM do you have available? Are faster compression algorithms also possible (eg. BLOSC) https://stackoverflow.com/a/48997927/4045774. What chunk_size and chunk_cache size are you using https://stackoverflow.com/a/48405220/4045774 ? How long does the whole computation takes now? int32/int64? – max9111 Jan 08 '20 at 15:50
  • Another question: Are you using h5py or pytables? Note: SciPy offers methods to work with sparse matrices (scipy.sparse). It adds coding overhead. Might be worth it if you don't want to save 90% zeros. – kcw78 Jan 08 '20 at 16:56
  • The `scipy.sparse` package has functions for saving sparse matrices to `npy` files, but nothing for `h5`. You could model a `h5py` save on the sparse save. Also sparse matrices aren't very suitable for chunked use. – hpaulj Jan 08 '20 at 17:31

0 Answers0