how do I aggregate 50 datasets within an HDf5 file

Question

I have an HDF5 file with 2 groups, each containing 50 datasets of 4D numpy arrays of same type per group. I want to combine all 50 datasets in each group into a single dataset. In other words, instead of 2 x 50 datasets I want 2x1 dataset. How can I accomplish this? The file is 18.4 Gb in size. I am a novice at working with large datasets. I am working in python with h5py.

Thanks!

What do you want your datasets to look like when you are done? — Igor Rivin, Nov 12 '21 at 02:58
Instead of 100 datasets I want 2 datasets. Each set of 50 is homogeneous. — Filibuster, Nov 12 '21 at 03:00
What does "homogeneous" mean? Is the final dataset supposed to be a numpy array? — Igor Rivin, Nov 12 '21 at 03:07

kcw78 · Answer 1 · 2021-11-14T19:39:55.837

Look at this answer: How can I combine multiple .h5 file? - Method 3b: Merge all data into 1 Resizeable Dataset. It describes a way to copy data from multiple HDF5 files into a single dataset. You want to do something similar. The only difference is all of your datasets are in 1 HDF5 file.

You didn't say how you want to stack the 4D arrays. In my first answer I stacked them along axis=3. As noted in my comment, I it's easier (and cleaner) to create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). I like this for 2 reasons: The code is simpler/easier to follow, and 2) it's more intuitive (to me) that axis=4 represents a unique dataset (instead of slicing on axis=3).

I wrote a self-contained example to demonstrate the procedure. First it creates some data and closes the file. Then it reopens the file (read only) and creates a new file for the copied datasets. It loops over the groups and and datasets in the first and copies the data into to a merged dataset in the second file. The 5D example is first, and my original 4D example follows.

Note: this is a simple example that will work for your specific case. If you are writing a general solution, it should check for consistent shapes and dtypes before blindly merging the data (which I don't do).

Code to create the Example data (2 groups, 5 datasets each):

import h5py
import numpy as np

# Create a simple H5 file with 2 groups and 5 datasets (shape=a0,a1,a2,a3)
with h5py.File('SO_69937402_2x5.h5','w') as h5f1:
    
    a0,a1,a2,a3 = 100,20,20,10
    grp1 = h5f1.create_group('group1')
    for ds in range(1,6):
        arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
        grp1.create_dataset(f'dset_{ds:02d}',data=arr)

    grp2 = h5f1.create_group('group2')
    for ds in range(1,6):
        arr = np.random.random(a0*a1*a2*a3).reshape(a0,a1,a2,a3)
        grp2.create_dataset(f'dset_{ds:02d}',data=arr)

Code to merge the data (2 groups, 1 5D dataset each -- my preference):

with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
     h5py.File('SO_69937402_2x1_5d.h5','w') as h5f2:
          
    # loop on groups in existing file (h5f1)
    for grp in h5f1.keys():
        # Create group in h5f2 if it doesn't exist
        print('working on group:',grp)
        h5f2.require_group(grp)
        # Loop on datasets in group
        ds_cnt = len(h5f1[grp].keys())
        for i,ds in enumerate(h5f1[grp].keys()):
            print('working on dataset:',ds)
            if 'merged_ds' not in h5f2[grp].keys():
            # If dataset doesn't exist in group, create it
            # Set maxshape so dataset is resizable
                ds_shape = h5f1[grp][ds].shape
                merge_ds = h5f2[grp].create_dataset('merged_ds',dtype=h5f1[grp][ds].dtype,
                                     shape=(ds_shape+(ds_cnt,)), maxshape=(ds_shape+(None,)) )

            # Now add data to the merged dataset 
            merge_ds[:,:,:,:,i] = h5f1[grp][ds]

Code to merge the data (2 groups, 1 4D dataset each):

with h5py.File('SO_69937402_2x5.h5','r') as h5f1, \
     h5py.File('SO_69937402_2x1_4d.h5','w') as h5f2:
          
    # loop on groups in existing file (h5f1)
    for grp in h5f1.keys():
        # Create group in h5f2 if it doesn't exist
        print('working on group:',grp)
        h5f2.require_group(grp)
        # Loop on datasets in group
        for ds in h5f1[grp].keys():
            print('working on dataset:',ds)
            if 'merged_ds' not in h5f2[grp].keys():
            # if dataset doesn't exist in group, create it
            # Set maxshape so dataset is resizable
                ds_shape = h5f1[grp][ds].shape
                merge_ds = h5f2[grp].create_dataset('merged_ds',data=h5f1[grp][ds],
                                     maxshape=[ds_shape[0],ds_shape[1],ds_shape[2],None])       
            else:
                # otherwise, resize the merged dataset to hold new values
                ds1_shape = h5f1[grp][ds].shape
                ds2_shape = merge_ds.shape
                merge_ds.resize(ds1_shape[3]+ds2_shape[3],axis=3)
                merge_ds[ :,:,:, ds2_shape[3]:ds2_shape[3]+ds1_shape[3] ] = h5f1[grp][ds]

Two things to consider: 1) Create the merged dataset as a 5d array, and stack the data along the 5th axis (axis=4). Advantages: You could size the dataset when you create it, and it might simplify the code (or at least improve readability). 2) Consider chunking to improve I/O performance. Size the chunk to match the shape you plan to read (set `chunks=`). — kcw78, Nov 13 '21 at 15:22
Hey thanks for the tips! Actually, the 4D array is a stack of 3D arrays stacked along axis 0. I think... I always get confused with these things. Lets say that is the case and I want to preserve the last 3 axis, then do I exchange 0 for 3 in your example of 2 groups 1 4D dataset each? — Filibuster, Nov 15 '21 at 03:42
Im actually a little confused as to the purpose of appending on the 3rd axis? — Filibuster, Nov 15 '21 at 04:36
Understanding the data model (aka "schema") of the existing and desired datasets is key to HDF5. Your comments help. In my initial answer, I append on `axis=3`, which is the _4th axis_. (I just picked the last axis). Did you see my updated answer? (where I append each dataset along axis=4 in a 5D array) IMHO, that is the easiest to understand and use. The array at `h5f1['group1']['dataset_0i'][:]` maps to the dataset at `h5f2['group1']['merged'][:,:,:,:,i-1]` (and vice versa). I can modify my answer if you want to create a 4D merged dataset and stack along axis=0. Let me know. — kcw78, Nov 15 '21 at 14:20

how do I aggregate 50 datasets within an HDf5 file

1 Answers1