New hdf5 from a group in a bigger hdf5

Question

I have created a huge hdf5 dataset in the following form:

group1/raw
group1/preprocessed
group1/postprocessed
group2/raw
group2/preprocessed
group2/postprocessed
....
group10/raw
group10/preprocessed
group10/postprocessed

However, I realized that for portability I would like to have 10 different hdf5 files, one for each group. Is there a function in python to achieve this without looping through all the data and scanning the entire original hdf5 tree?

something like:


import h5py

file_path = 'path/to/data.hdf5'

hf = h5py.File(file_path, 'r')

print(hf.keys())

for group in hf.keys():
    
    # create a new dataset for the group
    hf_tmp = h5py.File(group + '.h5', 'w')
    # get data from hf[key] and dumb them into the new file
    # something like
    # hf_tmp = hf[group]
    # hf_tmp.dumb()
    hf_tmp.close()


hf.close()

kcw78 · Answer 1 · 2021-06-03T17:56:19.480

You have the right idea. There are several questions and answers on SO that show how to do this.

Start with this one. It shows how to loop over the keys and determine if it's a group or a dataset.: h5py: how to use keys() loop over HDF5 Groups and Datasets

Then look at these. Each shows a slightly different approach to the problem.

This shows one way. Extracting datasets from 1 HDF5 file to multiple files

Also, here is an earlier post I wrote: How to copy a dataset object to a different hdf5 file using pytables or h5py?

This does the opposite (copies datasets from different files to 1 file). It's useful because it demonstrates how to use the .copy() method: How can I combine multiple .h5 file?

Finally, you should review visititems() method to recursively search all Groups and Datasets. Take a look at this answer for details: is there a way to get datasets in all groups at once in h5py?

That should answer your questions.

Below is some pseudo-code that pulls all of these ideas together. It works for your schema, where all datasets are in root level groups. It will not work for the more general case with datasets at multiple group levels. Use visititems() for the more general case.

Pseudo-code below:

with h5py.File(file_path, 'r') as hf:
    print(hf.keys())  
    # loop on group names at root level
    for group in hf.keys():        
        hf_tmp = h5py.File(group + '.h5', 'w')
        # loop on datasets names in group
        for dset in hf[group].keys():
        # copy dataset to the new group file 
            hf.copy(group+'/'+dset, hf_tmp)  
        hf_tmp.close()

New hdf5 from a group in a bigger hdf5

1 Answers1