0

I would like to save on my disk huge amount (millions) of small numpy matrices (shape=(224, 224, 1), dtype=np.float32).

I would like to have efficient access, and some kind of transparent compression.

One solution which works for me are hadoop files (and h5py library). There is only one drawback: I would like to have also multithreading access. I would like to write to files from few threads. I would like to care about conflicts in writing on my own.

Is there any suggested technology/database I should use?

Tomek
  • 41
  • 2
  • sql server has table compression – user1443098 Apr 30 '18 at 14:20
  • Have you checked [Dask](https://dask.pydata.org) for any off-the-shelf solutions? – MPA Apr 30 '18 at 14:28
  • It depends on which device you are writing. On a HDD, you usually get limited by the disk-IO speed even on a single thread (depending on the compression ratio). So there isn't a speedup when using multiple threads for data IO. It is also much more efficient, to put all arrays in a single dataset, since the overhead for creating a HDF5-dset is quite high. For performance comparison I would recommend to try h5py in one thread at first. eg. https://stackoverflow.com/a/48997927/4045774 – max9111 May 03 '18 at 08:03

0 Answers0