34

I have got a question about how best to write to hdf5 files with python / h5py.

I have data like:

-----------------------------------------
| timepoint | voltage1 | voltage2 | ...
-----------------------------------------
| 178       | 10       | 12       | ...
-----------------------------------------
| 179       | 12       | 11       | ...
-----------------------------------------
| 185       | 9        | 12       | ...
-----------------------------------------
| 187       | 15       | 12       | ...
                    ...

with about 10^4 columns, and about 10^7 rows. (That's about 10^11 (100 billion) elements, or ~100GB with 1 byte ints).

With this data, typical use is pretty much write once, read many times, and the typical read case would be to grab column 1 and another column (say 254), load both columns into memory, and do some fancy statistics.

I think a good hdf5 structure would thus be to have each column in the table above be a hdf5 group, resulting in 10^4 groups. That way we won't need to read all the data into memory, yes? The hdf5 structure isn't yet defined though, so it can be anything.

Now the question: I receive the data ~10^4 rows at a time (and not exactly the same numbers of rows each time), and need to write it incrementally to the hdf5 file. How do I write that file?

I'm considering python and h5py, but could another tool if recommended. Is chunking the way to go, with e.g.

dset = f.create_dataset("voltage284", (100000,), maxshape=(None,), dtype='i8', chunks=(10000,))

and then when another block of 10^4 rows arrives, replace the dataset?

Or is it better to just store each block of 10^4 rows as a separate dataset? Or do I really need to know the final number of rows? (That'll be tricky to get, but maybe possible).

I can bail on hdf5 if it's not the right tool for the job too, though I think once the awkward writes are done, it'll be wonderful.

nbro
  • 15,395
  • 32
  • 113
  • 196
user116293
  • 5,534
  • 4
  • 25
  • 17

2 Answers2

36

Per the FAQ, you can expand the dataset using dset.resize. For example,

import os
import h5py
import numpy as np
path = '/tmp/out.h5'
os.remove(path)
with h5py.File(path, "a") as f:
    dset = f.create_dataset('voltage284', (10**5,), maxshape=(None,),
                            dtype='i8', chunks=(10**4,))
    dset[:] = np.random.random(dset.shape)        
    print(dset.shape)
    # (100000,)

    for i in range(3):
        dset.resize(dset.shape[0]+10**4, axis=0)   
        dset[-10**4:] = np.random.random(10**4)
        print(dset.shape)
        # (110000,)
        # (120000,)
        # (130000,)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • is dtype='i8' a thing? I think 'int8' is 8 bits, but i8 seems to be larger. – user116293 Sep 09 '14 at 18:11
  • 1
    `i8` are for 8-byte ints. You can check the byte size using `np.dtype('i8').itemsize`. If you want 1-byte ints, use `np.int8` (aka `'i1'`). – unutbu Sep 09 '14 at 18:15
  • What does this notation mean `set[-10**4:]`? It means that you will assign `np.random.random(10**4)` to the last `-10**4` positions of the dataset? – nbro Sep 27 '19 at 21:08
  • @nbro: That's correct. See [Understanding slice notation](https://stackoverflow.com/q/509211/190597). – unutbu Sep 28 '19 at 01:24
  • dset is a type Dataset which is proper to h5py. Why can''t I perform this operation in Dataset without being concerned with the underlying representation (numpy)? It seems that the implementation should be hidden from the user and that Dataset.cat(dset, dset2) or dset.append(dset2) or something should be included as a standard function – demongolem Mar 20 '20 at 11:32
7

As @unutbu pointed out, dset.resize is an excellent option. It may be work while to look at pandas and its HDF5 support which may be useful given your workflow. It sounds like HDF5 is a reasonable choice given your needs but it is possible that your problem may be expressed better using an additional layer on top.

One big thing to consider is the orientation of the data. If you're primarily interested in reads, and you are primarily fetching data by column, then it sounds like you may want to transpose the data such that the reads can happen by row as HDF5 stores in row-major order.

daniel
  • 2,568
  • 24
  • 32