6

I have several groups in my h5 file: 'group1', 'group2', ... and each group has 3 different datasets: 'dataset1', 'dataset2', 'dataset3', all of which are arrays with numerical values but the size of array is different.

My goal is to save each dataset from group to a numpy array.

Example:

import h5py
filename = '../Results/someFileName.h5'
data = h5py.File(filename, 'r')

Now I can easily iterate over all groups with

for i in range(len(data.keys())):
    group = list(data.keys())[i]

but I can't figure out how to access the datasets within the group. So I am looking for something like MATLAB:

hinfo = h5info(filename);
for i = 1:length(hinfo.Groups())
     datasetname = [hinfo.Groups(i).Name '/dataset1'];
     dset = h5read(fn, datasetname);

Where dset is now an array of numbers.

Is there a way I could do the same with h5py?

kcw78
  • 7,131
  • 3
  • 12
  • 44
skrat
  • 648
  • 2
  • 10
  • 27

2 Answers2

15

You are have the right idea. But, you don't need to loop on range(len(data.keys())). Just use data.keys(); it generates an iterable list of object names. Try this:

import h5py
filename = '../Results/someFileName.h5'
data = h5py.File(filename, 'r')
for group in data.keys() :
    print (group)
    for dset in data[group].keys():      
        print (dset)
        ds_data = data[group][dset] # returns HDF5 dataset object
        print (ds_data)
        print (ds_data.shape, ds_data.dtype)
        arr = data[group][dset][:] # adding [:] returns a numpy array
        print (arr.shape, arr.dtype)
        print (arr)

Note: logic above is valid ONLY when there are only groups at the top level (no datasets). It does not test object types as groups or data sets.

To avoid these assumptions/limitations, you should investigate .visititems() or write a generator to recursively visit objects. The first 2 answers are examples showing .visititems() usage, and the last 1 uses a generator function:

  1. Use visititems(-function-) to loop recursively
    This example uses isinstance() as the test. The object is a Group when it tests true for h5py.Group and is a Dataset when it tests true for h5py.Dataset . I consider this more Pythonic than the second example below (IMHO).
  2. Convert hdf5 to raw organised in folders It checks for number of objects below the visited object. when there are no subgroups, it is a dataset. And when there subgroups, it is a group.
  3. How can I combine multiple .h5 file? This quesion has multipel answers. This answer uses a generator to merge data from several files with several groups and datasets into a single file.
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • 1
    @skrat, I modified my original post to fix a small (but significant) error retrieving data. When you use `arr = h5f[group][dset]`, you get an **HDF5 dataset object** (not a numpy array). In many ways the object behaves like an array (you can slice, etc). However, not all numpy array methods will work on a dataset object (`.reshape()` is an example). If you need a numpy array, add a range using numpy index notation (`[:]` for the entire dataset in my code). You can slice to get a subset of data as an array. Note: a limited subset of fancy indexing is supported. Read the h5py docs for details. – kcw78 May 18 '19 at 13:54
2

This method requires that dataset names, 'dataset1', 'dataset2', 'dataset3', etc., be the same in each of the hdf5 groups of one hdf5 file.

# create empty lists
lat = []
lon = []
x = []
y = []

# fill lists creating numpy arrays
h5f = h5py.File('filename.h5', 'r') # read file
for group in h5f.keys(): # iterate through groups
    for datasets in h5f[group].keys(): #iterate through datasets
        lat = np.append(lat, h5f[group]['lat'][()]) # append data
        lon = np.append(lon, h5f[group]['lon'][()])
        x = np.append(x, h5f[group]['x'][()])
        y = np.append(y, h5f[group]['y'][()])