How to store an array in hdf5 file which is too big to load in memory?

Question

Is there any way to store an array in an hdf5 file, which is too big to load in memory?

if I do something like this

f = h5py.File('test.hdf5','w')
f['mydata'] = np.zeros(2**32)

I get a memory error.

Take a look at [hyperslabs](http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage). It is possible, but you should write in 'chunks', and make the hdf5 file chunkable. — Mathias711, Mar 23 '15 at 11:43
http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage — Joe Doherty, Mar 23 '15 at 11:45

score 8 · Accepted Answer · answered Apr 16 '15 at 09:07

According to the documentation, you can use create_dataset to create a chunked array stored in the hdf5. Example:

>>> import h5py
>>> f = h5py.File('test.h5', 'w')
>>> arr = f.create_dataset('mydata', (2**32,), chunks=True)
>>> arr
<HDF5 dataset "mydata": shape (4294967296,), type "<f4">

Slicing the HDF5 dataset returns Numpy-arrays.

>>> arr[:10]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)
>>> type(arr[:10])
numpy.array

You can set values as for a Numpy-array.

>>> arr[3:5] = 3
>>> arr[:6]
array([ 0.,  0.,  0.,  3.,  3.,  0.], dtype=float32)

I don't know if this is the most efficient way, but you can iterate over the whole array in chunks. And for instance setting it to random values:

>>> import numpy as np
>>> for i in range(0, arr.size, arr.chunks[0]):
        arr[i: i+arr.chunks[0]] = np.random.randn(arr.chunks[0])
>>> arr[:5]
array([ 0.62833798,  0.03631227,  2.00691652, -0.16631022,  0.07727782], dtype=float32)

What if dataset size is not known beforehand? Can it be done in append mode? — mrgloom, May 31 '17 at 12:15
@mrgloom Maybe this suits your needs? https://stackoverflow.com/a/25656175/3635816 — RickardSjogren, May 31 '17 at 13:01

How to store an array in hdf5 file which is too big to load in memory?

1 Answers1

Linked

Related