2

I have an HDF5 file with 20 datasets, each with 200 rows of compound dtype ('<r4', '<r4', '<i4') where each component of the dtype represents a 1-D variable. I am finding that it takes about 2 seconds to open each file and assign component of the column to its own variable, which seems remarkably slow to me. I'm using h5py and numpy to open and read from the file into numpy arrays:

import numpy as np
import h5py
...
f = h5py.File("foo.hdf5", "r")
set1 = f["foo/bar"]
var1 = np.asarray([row[0] for row in set1])
var2 = np.asarray([row[1] for row in set1])
var3 = np.asarray([row[2] for row in set1])

Is there a faster way to extract the variables from these datasets?

Here is a screenshot of one of the datasets using hdfview: hdfview

DavidH
  • 415
  • 4
  • 21
  • 1
    Have you tried `var1 = set1[:,0]`? If `set1` is a `dataset` you can index it in almost the same way(s) as a numpy array. http://docs.h5py.org/en/latest/high/dataset.html – hpaulj Feb 03 '16 at 19:26
  • Yes, and I receive the error "TypeError: Argument sequence too long". Additionally, if I try `set1 = f["foo/bar"][...]` followed by `var1 = set1[:,0]` I receive "IndexError: too many indices for array" – DavidH Feb 03 '16 at 19:37
  • What is `set1.shape`? Was the first `set1` a `group` rather than a `dataset`? I don't know the data structure in your file. You have to explore that yourself. – hpaulj Feb 03 '16 at 19:50
  • `set1.shape` returns (200,) – DavidH Feb 03 '16 at 20:07
  • You could try using Pytables (http://www.pytables.org/). It has a built-in Numpy interface and its optimized for speed. – Dietrich Feb 03 '16 at 20:20
  • 2
    Then `set1` is 1d, all rows and no columns. But it might have a complex `dtype`, the `h5py` equivalent of a numpy structured array. I'll have to check the documentation. `set1.dtype`? – hpaulj Feb 03 '16 at 20:32
  • http://stackoverflow.com/a/33249726/901925 a h5py SO question with structured arrays. – hpaulj Feb 03 '16 at 20:40
  • It does indeed have a complex datatype `(' – DavidH Feb 03 '16 at 21:07

1 Answers1

3

A much faster way (~0.05 seconds) is to transform the dataset into an array and then reference the fields by name:

import numpy as np
import h5py
...
f = h5py.File("foo.hdf5", "r")
set1 = np.asarray(f["foo/bar"])
var1 = set1["var1"]
var2 = set1["var2"]
var3 = set1["var3"]
DavidH
  • 415
  • 4
  • 21