1

How can I check if a datasets exists using something like a regex, without first reading the paths of all datasets?

For example, I want to check if a dataset 'completed' exists in a file that may (or may not) contain

/123/completed

(Suppose that I do not a-priori know the complete path, I just want to check for a dataset name. So this answer will not work in my case.)

jpp
  • 159,742
  • 34
  • 281
  • 339
Tom de Geus
  • 5,625
  • 2
  • 33
  • 77

2 Answers2

1

Custom recursion

No need for regex. You can build a set of dataset names by recursively traversing the groups in your HDF5 file:

import h5py

def traverse_datasets(hdf_file):

    """Traverse all datasets across all groups in HDF5 file."""

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = '{}/{}'.format(prefix, key)
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    with h5py.File(hdf_file, 'r') as f:
        for (path, dset) in h5py_dataset_iterator(f):
            yield path.split('/')[-1]

all_datasets = set(traverse_datasets('file.h5'))

Then just check for membership: 'completed' in all_datasets.

Group.visit

Alternatively, you can use Group.visit. Note you need your searching function to return None to iterate all groups.

res = []

def searcher(name, k='completed'):
    """ Find all objects with k anywhere in the name """
    if k in name:
        res.append(name)
        return None

with h5py.File('file.h5', 'r') as f:
    group = f['/']
    group.visit(searcher)

print(res)  # print list of dataset names matching criterion

Complexity is O(n) in both cases. You need to test the name of each dataset, but nothing more. The first option may be preferable if you need a lazy solution.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • Thanks! Can you comment about the complexity of this operation? I.e. how fast is it? (I ask because I did something similar before, and the time increased with size of the file, i.e. the number of entries in each datasets, rather than with the number of datasets.) – Tom de Geus Jun 06 '18 at 12:48
  • Also, I found the `visit` function as part of the library ([see documentation](http://docs.h5py.org/en/latest/high/group.html#Group.visit)). Would that be more efficient / less efficient / comparable (besides that it would programmatically be less nice) – Tom de Geus Jun 06 '18 at 12:51
  • @TomdeGeus, That seems to work too. I updated with an example. – jpp Jun 06 '18 at 14:08
0

Recursion to Find All Valid Paths to dataset(s)

The following code uses recursion to find valid data paths to all dataset(s). After getting the valid paths (terminating possible circular references after 3 repeats) I then can use a regular expression against the returned list (not shown) .

import numpy as np
import h5py
import collections
import warnings


def visit_data_sets(group, max_len_check=20, max_repeats=3):
    # print(group.name)
    # print(list(group.items()))

    if len(group.name) > max_len_check:
        # this section terminates a circular reference after 4 repeats. However it  will
        # incorrectly terminate  a tree if the identical repetitive sequences of names are
        # actually used in the tree.
        name_list = group.name.split('/')
        current_name = name_list[-1]
        res_list = [i for i in range(len(name_list)) if name_list[i] == current_name]
        res_deq = collections.deque(res_list)
        res_deq.rotate(1)
        res_deq2 = collections.deque(res_list)
        diff = [res_deq2[i] - res_deq[i] for i in range(0, len(res_deq))]

        if len(diff) >= max_repeats:
            if diff[-1] == diff[-2]:
                message = 'Terminating likely circular reference "{}"'.format(group.name)
                warnings.warn(message, UserWarning)
                print()
                return []

    dataset_list = list()
    for key, value in group.items():
        if isinstance(value, h5py.Dataset):
            current_path = group.name + '/{}'.format(key)
            dataset_list.append(current_path)
        elif isinstance(value, h5py.Group):
            dataset_list += visit_data_sets(value)

        else:
            print('Unhandled class name {}'.format(value.__class__.__name__))

    return dataset_list

def visit_callback(name, object):
    print('Visiting name = "{}", object name = "{}"'.format(name, object.name))
    return None

hdf_fptr = h5py.File('link_test.hdf5', mode='w')

group1 = hdf_fptr.require_group('/junk/group1')
group1a = hdf_fptr.require_group('/junk/group1/group1a')
# group1a1 = hdf_fptr.require_group('/junk/group1/group1a/group1ai')
group2 = hdf_fptr.require_group('/junk/group2')
group3 = hdf_fptr.require_group('/junk/group3')

# create a circular reference
group1ai = group1a['group1ai'] = group1


avect = np.arange(0,12.3, 1.0)

dset = group1.create_dataset('avect', data=avect)

group2['alias'] = dset
group3['alias3'] = h5py.SoftLink(dset.name)


print('\nThis demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"')
print('Visiting Root - {}'.format(hdf_fptr.name))
hdf_fptr.visititems(visit_callback)

print('\nThis demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"')
print('Visiting Group - {}'.format(group2.name))
group2.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"')
print('Visiting Group - {}'.format(group3.name))
group3.visititems(visit_callback)


print('\n\nNow demonstrate recursive visit of Root looking for datasets')
print('using the function "visit_data_sets" in this code snippet.\n')
data_paths = visit_data_sets(hdf_fptr)

for data_path in data_paths:
    print('Data Path = "{}"'.format(data_path))

hdf_fptr.close()

The following output shows how "visititems" works, or for my purposes fails to work, in identifying all valid paths while the recursion meets my needs and possibly yours.

This demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"
Visiting Root - /
Visiting name = "junk", object name = "/junk"
Visiting name = "junk/group1", object name = "/junk/group1"
Visiting name = "junk/group1/avect", object name = "/junk/group1/avect"
Visiting name = "junk/group1/group1a", object name = "/junk/group1/group1a"
Visiting name = "junk/group2", object name = "/junk/group2"
Visiting name = "junk/group3", object name = "/junk/group3"

This demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"
Visiting Group - /junk/group2
Visiting name = "alias", object name = "/junk/group2/alias"

This demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"
Visiting Group - /junk/group3


Now demonstrate recursive visit of Root looking for datasets
using the function "visit_data_sets" in this code snippet.

link_ref_test.py:26: UserWarning: Terminating likely circular reference "/junk/group1/group1a/group1ai/group1a/group1ai/group1a"

  warnings.warn(message, UserWarning)
Data Path = "/junk/group1/avect"
Data Path = "/junk/group1/group1a/group1ai/avect"
Data Path = "/junk/group1/group1a/group1ai/group1a/group1ai/avect"
Data Path = "/junk/group2/alias"
Data Path = "/junk/group3/alias3"

The first "Data Path" result is the original dataset. The second and third are references to the original dataset caused by a circular reference. The fourth result is a Hard Link and the fifth is a Soft Link to the original dataset.