How to append a group in an H5 file to another H5 file without overwriting the same group in the second file

Question

I have 2 H5 files, file1.h5 and file2.h5. Some of the contents of the files are as follows:

file1:

group1
- keyname1
- keyname2

file2:

group1
- dataframe1
- dataframe3

Both files may contain other groups. I want to append the contents of group1 in file1 to the contents of group1 in file2 without overwriting the original contents of file2, so that at the end of the process, file2 has the following form:

group1
- dataframe1 (file1 contents appended to file2 original contents)
- dataframe2
- dataframe3

I know the copy method of h5py can copy a group from one H5 file to another, but the code

import h5py
with h5py.File('file1.h5','r') as g:
    with h5py.File('file2.h5','a') as h:
        g.copy('group1',h)

will overwrite the original contents of file2, and I don't want to do that.

I know I could do the following:

import h5py
import pandas as pd
with h5py.File('file1.h5','r') as g:
    keynames = g['group1'].keys()
for name in keynames:
    df = pd.read_hdf('file1.h5',key = 'group1/' + name)
    df.to_hdf('file2.h5',key = 'group1/' + name,mode = 'a',append = True)

Is there a simpler, more convenient way to do this, along the lines of the h5py copy method?

I think `dataframe1` for both files has to be loaded and concatenated on the appropriate axis. Then if there's a way to delete `dataframe1` from `file2` (check the docs), write the new array to that group. In the worse case write it to an different dataset name. While it is possible to define a dataset that can grow, in general you can't changes the size of an existing dataset. — hpaulj, Dec 19 '20 at 01:49

kcw78 · Answer 1 · 2020-12-21T02:18:41.043

I don't know if this is simpler, but it is a process to copy data without over writing existing groups and datasets. It uses h5_object.visititems() to recursively visit all objects in a group and it's subgroups. This retrieves groups and datasets one at a time. You write the "visitor function" to operate on the objects as they are found.

The bulk of my example creates 2 files with groups and datasets (to demonstrate). Focus on def visitor_func(name, node). That is where the work is done. I included extra print statements to show what's happening. My visitor function does the following:

It checks if an object in File 1 is in File 2. If so, it skips.
If the object (group or dataset) is NOT in File 2, it copies the object to File2.
By default all objects within that group will be copied recursively (so you get datasets and subgroups).
For datasets, the name= parameter is used to copy it to the same location/path in File2.

Note that this code does NOT append data for common datasets from File 1 to File 2. For example, both files have a dataset '/group2/ds1'. I DO NOT copy that data. I need to know more about your data structure to write the code to append. There are several things to consider if you want to append data to an existing dataset in File2. For example:

Both datasets must have the same dtype (ints, floats, etc or a recarray).
Both datasets must have compatible shapes.
You need to define how you want to append (along which array axis?)
If you want to add data later, special steps are required (a priori) to create resizeble datasets. You need to use the maxshape=() parameter. Resizeble datasets also need chunked storage enabled. (I think a default chunk size is set when you use maxshape.)

My example datasets highlight the challenge. All datasets are (10,10) ndarrays of floats. So, how should I append a (10,10) array in File 1 to a (10,10) array in File2? Should the result be:

A) a (20,10) array (along axis=0), or
B) a (10,20) array (along axis=1), or
C) a (10,10,2) array (along a new axis=2)

All are logical and valid. The "correct answer" depends on your data schema.

Look at Methods 3a and 3b in this answer for some ideas: How can I combine multiple .h5 file?

Example code below:

import h5py
import numpy as np

def visitor_func(name, node):
    print('working on name:', name, ', path=',node.parent.name)
    if isinstance(node,h5py.Group):
        print ('h5f1 object found:',name,'is a group')
    elif isinstance(node,h5py.Dataset):
        print ('h5f1 object found:',name,'is a dataset')
 
    if h5f2.__contains__(name):
        print ('Object:', name, 'also in File2. Skipping...\n')
    else:
        print ('Object:', name, 'NOT in File2. Copying...\n')
        h5f1.copy(node,h5f2,name=name)
        

# Create File1 with 2 Groups with 2 Datasets in each 
with h5py.File('SO_65365873_1.h5', mode='w') as h5f1:
    h5f1.create_group('/group1')
    arr = np.random.random((10,10))
    h5f1.create_dataset('/group1/df1', data=arr)
    arr = np.random.random((10,10))
    h5f1.create_dataset('/group1/df2',data=arr)
    h5f1.create_group('/group2')
    arr = np.random.random((10,10))
    h5f1.create_dataset('/group2/df1', data=arr)
    arr = np.random.random((10,10))
    h5f1.create_dataset('/group2/df2',data=arr)

# Create File2 with 1 Group with 2 Datasets    
with h5py.File('SO_65365873_2.h5', mode='w') as h5f2:
    h5f2.create_group('/group2')
    arr = np.random.random((10,10))
    h5f2.create_dataset('/group2/df1', data=arr)
    arr = np.random.random((10,10))
    h5f2.create_dataset('/group2/df3',data=arr)

# Copy data from File1 to File2 WITHOUT overwriting
with h5py.File('SO_65365873_1.h5', mode='r') as h5f1:
    with h5py.File('SO_65365873_2.h5', mode='a') as h5f2:
        h5f1.visititems(visitor_func)

How to append a group in an H5 file to another H5 file without overwriting the same group in the second file

1 Answers1