I don't know if this is simpler, but it is a process to copy data without over writing existing groups and datasets. It uses h5_object.visititems()
to recursively visit all objects in a group and it's subgroups. This retrieves groups and datasets one at a time. You write the "visitor function" to operate on the objects as they are found.
The bulk of my example creates 2 files with groups and datasets (to demonstrate). Focus on def visitor_func(name, node)
. That is where the work is done. I included extra print statements to show what's happening. My visitor function does the following:
- It checks if an object in File 1 is in File 2. If so, it skips.
- If the object (group or dataset) is NOT in File 2, it copies the object to File2.
- By default all objects within that group will be copied recursively (so you get datasets and subgroups).
- For datasets, the
name=
parameter is used to copy it to the same
location/path in File2.
Note that this code does NOT append data for common datasets from File 1 to File 2. For example, both files have a dataset '/group2/ds1'. I DO NOT copy that data. I need to know more about your data structure to write the code to append. There are several things to consider if you want to append data to an existing dataset in File2. For example:
- Both datasets must have the same dtype (ints, floats, etc or a recarray).
- Both datasets must have compatible shapes.
- You need to define how you want to append (along which array axis?)
- If you want to add data later, special steps are required (a priori) to create resizeble datasets. You need to use the
maxshape=()
parameter. Resizeble datasets also need chunked storage enabled. (I think a default chunk size is set when you use maxshape
.)
My example datasets highlight the challenge. All datasets are (10,10) ndarrays of floats. So, how should I append a (10,10) array in File 1 to a (10,10) array in File2? Should the result be:
- A) a (20,10) array (along axis=0), or
- B) a (10,20) array (along axis=1), or
- C) a (10,10,2) array (along a new axis=2)
All are logical and valid. The "correct answer" depends on your data schema.
Look at Methods 3a and 3b in this answer for some ideas: How can I combine multiple .h5 file?
Example code below:
import h5py
import numpy as np
def visitor_func(name, node):
print('working on name:', name, ', path=',node.parent.name)
if isinstance(node,h5py.Group):
print ('h5f1 object found:',name,'is a group')
elif isinstance(node,h5py.Dataset):
print ('h5f1 object found:',name,'is a dataset')
if h5f2.__contains__(name):
print ('Object:', name, 'also in File2. Skipping...\n')
else:
print ('Object:', name, 'NOT in File2. Copying...\n')
h5f1.copy(node,h5f2,name=name)
# Create File1 with 2 Groups with 2 Datasets in each
with h5py.File('SO_65365873_1.h5', mode='w') as h5f1:
h5f1.create_group('/group1')
arr = np.random.random((10,10))
h5f1.create_dataset('/group1/df1', data=arr)
arr = np.random.random((10,10))
h5f1.create_dataset('/group1/df2',data=arr)
h5f1.create_group('/group2')
arr = np.random.random((10,10))
h5f1.create_dataset('/group2/df1', data=arr)
arr = np.random.random((10,10))
h5f1.create_dataset('/group2/df2',data=arr)
# Create File2 with 1 Group with 2 Datasets
with h5py.File('SO_65365873_2.h5', mode='w') as h5f2:
h5f2.create_group('/group2')
arr = np.random.random((10,10))
h5f2.create_dataset('/group2/df1', data=arr)
arr = np.random.random((10,10))
h5f2.create_dataset('/group2/df3',data=arr)
# Copy data from File1 to File2 WITHOUT overwriting
with h5py.File('SO_65365873_1.h5', mode='r') as h5f1:
with h5py.File('SO_65365873_2.h5', mode='a') as h5f2:
h5f1.visititems(visitor_func)