What is a fast way in Python to read HDF5 compound dtype arrays?

Question

I have an HDF5 file with 20 datasets, each with 200 rows of compound dtype ('<r4', '<r4', '<i4') where each component of the dtype represents a 1-D variable. I am finding that it takes about 2 seconds to open each file and assign component of the column to its own variable, which seems remarkably slow to me. I'm using h5py and numpy to open and read from the file into numpy arrays:

import numpy as np
import h5py
...
f = h5py.File("foo.hdf5", "r")
set1 = f["foo/bar"]
var1 = np.asarray([row[0] for row in set1])
var2 = np.asarray([row[1] for row in set1])
var3 = np.asarray([row[2] for row in set1])

Is there a faster way to extract the variables from these datasets?

Here is a screenshot of one of the datasets using hdfview: hdfview

Have you tried `var1 = set1[:,0]`? If `set1` is a `dataset` you can index it in almost the same way(s) as a numpy array. http://docs.h5py.org/en/latest/high/dataset.html — hpaulj, Feb 03 '16 at 19:26
Yes, and I receive the error "TypeError: Argument sequence too long". Additionally, if I try `set1 = f["foo/bar"][...]` followed by `var1 = set1[:,0]` I receive "IndexError: too many indices for array" — DavidH, Feb 03 '16 at 19:37
What is `set1.shape`? Was the first `set1` a `group` rather than a `dataset`? I don't know the data structure in your file. You have to explore that yourself. — hpaulj, Feb 03 '16 at 19:50
You could try using Pytables (http://www.pytables.org/). It has a built-in Numpy interface and its optimized for speed. — Dietrich, Feb 03 '16 at 20:20
Then `set1` is 1d, all rows and no columns. But it might have a complex `dtype`, the `h5py` equivalent of a numpy structured array. I'll have to check the documentation. `set1.dtype`? — hpaulj, Feb 03 '16 at 20:32
http://stackoverflow.com/a/33249726/901925 a h5py SO question with structured arrays. — hpaulj, Feb 03 '16 at 20:40

score 3 · Accepted Answer · answered Feb 04 '16 at 16:44

A much faster way (~0.05 seconds) is to transform the dataset into an array and then reference the fields by name:

import numpy as np
import h5py
...
f = h5py.File("foo.hdf5", "r")
set1 = np.asarray(f["foo/bar"])
var1 = set1["var1"]
var2 = set1["var2"]
var3 = set1["var3"]

What is a fast way in Python to read HDF5 compound dtype arrays?

1 Answers1