Is there a way to get datasets in all groups at once in h5py?

Question

I have data stored in .h5. I use the following code to display group names and also call one of the groups (Event_[0]) to see what's inside:

with h5py.File(data_path, 'r') as f:
    ls = list(f.keys())
    print('List of datasets: \n', ls)
    data = f.get('group_1')
    dataset1 = np.array(data)
    print('Shape of dataset1: \n', dataset1.shape)
    f.close()

It works fine but I have like 2000 groups with one dataset each. How can I avoid writing the same code for every single group? Is there maybe a way to get('all groups')?

EDIT: one more example: I use

f['Event_[0]'][()]

to see one group. Can this be also applied for multiple groups?

is there a reason you don't use pandas? Can you provide an example of your data and the groups you are creating? — Andreas, Aug 08 '20 at 14:55
@Andreas, this isn't about `pandas` style grouping. Here `group` is a level in the file data hierarchy. — hpaulj, Aug 08 '20 at 15:10
ahh, i mean i am no h5 specialist but the documentation states that: a"n HDF5 group is a structure containing zero or more HDF5 objects. A group has two parts: A group header, which contains a group name and a list of group attributes. " What happens if you load it via pandas? Is the group name not shown? e.g. as a column or anything? — Andreas, Aug 08 '20 at 15:15
@Andreas, I wish it was that simple; can't use pandas. The file contains: 1 folder that has over 2000 groups (keys). I want to display what's inside all together. But I only found how to do it for just ONE group. — Brainiac, Aug 08 '20 at 15:20
@Andreas, uses a different interface to `HDF5` files, `pytables`. — hpaulj, Aug 08 '20 at 15:30

hpaulj · Accepted Answer · 2020-08-08T15:41:53.163

Just iterate on the list of keys:

with h5py.File(data_path, 'r') as f:
    alist = []
    ls = list(f.keys())
    print('List of datasets: \n', ls)
    for key in ls:
         group = f.get(key)
         dataset = group.get(datasetname)[:]
         print('Shape of dataset: \n', dataset.shape)
         alist.append(dataset)
    # don't need f.close() in a with

There isn't an allgroups; there are iter and visit methods, but they end up doing the same thing - for each group in the file, fetch the desired dataset. h5py docs should be complete, without hidden methods. The visit is recursive, and similar to Python OS functionality for visiting directories and files.

In h5py the file and groups behave like Python dicts. It's the dataset that behaves like a numpy array.

score 0 · Answer 2 · answered Aug 08 '20 at 19:22

If you know you will always have this data schema, you can work with the keys (as shown in the previous answer). That implies only Groups at the root level, and Datasets are the only objects under each Group. The "visitor" functions are very handy when you don't know the exact contents of the file.

There are 2 visitor functions. They are visit() and visititems(). Each recursively traverses the object tree, calling the visitor function for each object. The only difference is that callable function for visit receives 1 value: name, and for visititems it receives 2 values: name and node (a h5py object). The name is just that, an object's name, NOT it's full pathname. I prefer visititems for 2 reasons: 1) Having the node object allows you to do tests on the object type (as shown below), and 2) Determining the pathname requires you know the path or you use the object's name attribute to get it.

The example below creates a simple HDF5 file, creates a few groups and datasets, then closes the file. It then reopens in read mode and uses visititems() to traverse the file object tree. (Note: the visitor functions can have any name and can be used with any object. It traverses recursively from that point in the file structure.)

Also, you don't need f.close() when you use the with / as: construct.

import h5py
import numpy as np

def visit_func(name, node) :
    print ('Full object pathname is:', node.name)
    if isinstance(node, h5py.Group) :
        print ('Object:', name, 'is a Group\n')
    elif isinstance(node, h5py.Dataset) :
        print ('Object:', name, 'is a Dataset\n')
    else :
        print ('Object:', name, 'is an unknown type\n')

arr = np.arange(100).reshape(10,10)

with h5py.File('SO_63315196.h5', 'w') as h5w:
    for cnt in range(3):
        grp = h5w.create_group('group_'+str(cnt)) 
        grp.create_dataset('data_'+str(cnt),data=arr) 
    
with h5py.File('SO_63315196.h5', 'r') as h5r:     
    h5r.visititems(visit_func)

Is there a way to get datasets in all groups at once in h5py?

2 Answers2

Linked