How to combine multiple hdf5 files into one file and dataset?

Question

import h5py
import numpy as np

with h5py.File("myCardiac.hdf5", "w") as f:
    dset = f.create_dataset("mydataset", (100,), dtype = 'i')
    grp = f.create_group("G:/Brain Data/Brain_calgary/")

I tried this code to create a hdf5 file. There are 50 hhdf5 files in a folder. I want to combine all 50 hdf5 files into one hdf5 file dataset.

So, you have 50 h5 files which contains one (or more?) dataset and you want to copy them into the new file? Or concatenate them to one h5py.Dataset? — Juraj, Apr 11 '22 at 17:29
This 50 h5 files contains same named dataset 'kspace'. I want to concatenate into one HDF5 file. — Gulfam Ahmed Saju, Apr 11 '22 at 17:32
Can you provide more information about the "kspace" dataset? Do they all have the same shape? And what is the shape? — Juraj, Apr 11 '22 at 17:43
Yes all have same size. Below is the sample of one h5 file................................................... Group '/' Dataset 'kspace' Size: 24x170x218x256 MaxSize: 24x170x218x256 Datatype: H5T_IEEE_F32LE (single) ChunkSize: [] Filters: none FillValue: 0.000000 — Gulfam Ahmed Saju, Apr 11 '22 at 17:47
Ok, and you want to concatenate along some axis? Or make a stack, so the result dataset will have shape **50**x24x170x218x256 — Juraj, Apr 11 '22 at 17:58
The h5py Group `.copy()` method is handy, but you can't use it to merge multiple datasets into 1 dataset. Start with this answer: [How can I combine multiple .h5 file?](https://stackoverflow.com/a/58223603/10462884) It shows 4 methods to copy data from multiple h5 files to 1 h5 file. You want to use either **Method 3a or 3b** to merge all data into 1 dataset. Do you plan to keep the initial 50 files? If so, you could use links to region references in each file, but that is more a sophisticated approach that goes beyond the previous answer. — kcw78, Apr 11 '22 at 19:09

score 2 · Answer 1 · answered Apr 13 '22 at 19:24

Another approach is to use HDF5 virtual datasets to get a combined view in the new file without duplicating the data. h5py reference here. Example below is adapted from a previous answer. See: h5py error reading virtual dataset into NumPy array. Notes on this procedure:

Comments in code document most of the steps.
I used glob.glob() to get a list of the source filenames using wildcards (assumes source files are named: myCardiac_01.hdf5, myCardiac_02.hdf5, myCardiac_03.hdf5, etc).
First source file is accessed to get dtype and shape. Values must be the same for all source files.
Last step prints some arbitrary data slices to demonstrate behavior based on source shape=(24,170,218,256). Modify as appropriate for other data sources.

Source code below:

import h5py
import glob

h5_files = glob.glob('myCardiac_0?.hdf5')

# Get parameters from source files to define virtual layout and dataset
a0 = len(h5_files)
with h5py.File(h5_files[0],'r') as h5f:
    h5_dtype = h5f['kspace'].dtype
    h5_shape = h5f['kspace'].shape
    print(h5_dtype,h5_shape)
    
# Assemble virtual dataset
vs_layout = h5py.VirtualLayout(shape=((a0,)+h5_shape), dtype=h5_dtype)
for n, h5file in enumerate(h5_files):
    vs_layout[n] = h5py.VirtualSource(h5file, 'kspace', shape=h5_shape)

# Add virtual dataset to output file
with h5py.File('myCardiac_VDS.h5', 'w') as f:
    f.create_virtual_dataset('kspace_vdata', vs_layout)

# print some data slices from the virtual dataset
with h5py.File('myCardiac_VDS.h5', 'r') as f:
    vds_ds = f['kspace_vdata']
    print(vds_ds.dtype,vds_ds.shape)
    for i in range(vds_ds.shape[0]):
        print(f'Slice from file {i}:\n{vds_ds[i,:,0,0,0]}')

score 1 · Accepted Answer · answered Apr 11 '22 at 18:50

To merge 50 .h5 files, each with a dataset named kspace and the form (24, 170, 218, 256), into one large dataset, use this code:

import h5py
import os

with h5py.File("myCardiac.hdf5", "w") as f_dst:
    h5files = [f for f in os.listdir() if f.endswith(".h5")]

    dset = f_dst.create_dataset("mydataset", shape=(len(h5files), 24, 170, 218, 256), dtype='f4')

    for i, filename in enumerate(h5files):
        with h5py.File(filename) as f_src:
            dset[i] = f_src["kspace"]

Detailed description

Firstly, you must create a destination file myCardiac.hdf5. Then get the list of all .h5 files in the directory:

h5files = [f for f in os.listdir() if f.endswith(".h5")]

NOTE: os.listdir() without arguments gets list of files/foldes in the current working directory. I expect this python script to be in the same directory as the files and the CWD will be set to this directory.

The next step is to create a dataset in the destination file with the desired size and data type:

dset = f_dst.create_dataset("mydataset", shape=(len(h5files), 24, 170, 218, 256), dtype='f4')

You can then iteratively copy the data from the source files to the target dataset.

for i, filename in enumerate(h5files):
    with h5py.File(filename) as f_src:
        dset[i] = f_src["kspace"]

Use `glob.iglob()` to get a list of files. Much cleaner than list comprehension with `os.listdir()` , and no intermediate list object needed, — kcw78, Apr 11 '22 at 19:11
Suppose, if the h5 files have different sizes , what changes should be made? — Gulfam Ahmed Saju, Apr 11 '22 at 19:33
@Gulfam Ahmed Saju, that _**could**_ be done. However, it gets tricky handling the general case. For example, will all datasets have the same # of axes (4 in your case) or different shapes? Will there be more than 1 axis with a different size? To do this, you need to get the shape of every dataset from the source files, and find the max dimension of each. Then use that to allocate the merged dataset accordingly. (you should also check dtype compatibility) The real question: why merge the data? This is a lot of work to create duplicate data with no obvious benefit. — kcw78, Apr 12 '22 at 00:25

How to combine multiple hdf5 files into one file and dataset?

2 Answers2

Linked