Have you investigated h5py Group .copy()
method? Although documented as a group action, it works with any h5py object (groups, datasets, links and references). By default it copies object attributes, and supports recursive copying of group members. If you prefer a command line tool, the HDF Group has one to do this. Take a look at h5copy
here: HDF5 Group h5 copy doc
Here is a example that demonstrates a simple h5py .copy()
implementation. It creates a set of 3 files -- each with 1 dataset (named /dataset
, dtype=float
, shape=(10,10)
). It then creates a NEW HDF5 file, and is followed by another loop to open the previous files and copies the dataset from the "read" file (h5r
) to the new "write" file (h5w
).
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='w') as h5f:
arr = np.random.random(100).reshape(10,10)
h5f.create_dataset('dataset',data=arr)
with h5py.File('SO_68025342_all.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
h5r.copy('dataset', h5w, name='dataset_'+str(i) )
Here is a method to copy data from multiple files to a single dataset in the merged file. It comes with caveats: 1) all datasets must have the same shape, and 2) you know the number of datasets in advance to size the new dataset. (If not, you can create a resizeable dataset by addingmaxshape=(None,a0,a1)
, and then use .resize()
as needed. I have another post with 2 examples here: How can I combine multiple .h5 file? Look at Methods 3a and 3b.
with h5py.File('SO_68025342_merge.h5',mode='w') as h5w:
for i in range (1,4):
with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
if 'dataset' not in h5w.keys():
a0, a1 = h5r['dataset'].shape
h5w.create_dataset('dataset', shape=(3,a0,a1))
h5w['dataset'][i-1,:] = h5r['dataset']
Assuming your files aren't so conveniently named, you can use glob.iglob()
to loop on the file names to read. Then use .keys()
to get the dataset names in each file. Also, if all of your datasets really are named /dataset
, you need to come up with a naming convention for the new datasets.
Here is a link to the h5py docs with more details: h5py Group .copy() method