11

Is there any way to store an array in an hdf5 file, which is too big to load in memory?

if I do something like this

f = h5py.File('test.hdf5','w')
f['mydata'] = np.zeros(2**32)

I get a memory error.

Sounak
  • 4,803
  • 7
  • 30
  • 48
  • 3
    Take a look at [hyperslabs](http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage). It is possible, but you should write in 'chunks', and make the hdf5 file chunkable. – Mathias711 Mar 23 '15 at 11:43
  • 3
    http://docs.h5py.org/en/latest/high/dataset.html#chunked-storage – Joe Doherty Mar 23 '15 at 11:45

1 Answers1

8

According to the documentation, you can use create_dataset to create a chunked array stored in the hdf5. Example:

>>> import h5py
>>> f = h5py.File('test.h5', 'w')
>>> arr = f.create_dataset('mydata', (2**32,), chunks=True)
>>> arr
<HDF5 dataset "mydata": shape (4294967296,), type "<f4">

Slicing the HDF5 dataset returns Numpy-arrays.

>>> arr[:10]
array([ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.], dtype=float32)
>>> type(arr[:10])
numpy.array

You can set values as for a Numpy-array.

>>> arr[3:5] = 3
>>> arr[:6]
array([ 0.,  0.,  0.,  3.,  3.,  0.], dtype=float32)

I don't know if this is the most efficient way, but you can iterate over the whole array in chunks. And for instance setting it to random values:

>>> import numpy as np
>>> for i in range(0, arr.size, arr.chunks[0]):
        arr[i: i+arr.chunks[0]] = np.random.randn(arr.chunks[0])
>>> arr[:5]
array([ 0.62833798,  0.03631227,  2.00691652, -0.16631022,  0.07727782], dtype=float32)
RickardSjogren
  • 4,070
  • 3
  • 17
  • 26