How can I combine multiple .h5 file?

Question

Everything that is available online is too complicated. My database is large to I exported it in parts. I now have three .h5 file and I would like to combine them into one .h5 file for further work. How can I do it?

My suggestion would be to write a simple Python code using the `h5py` library — Tom de Geus, Oct 01 '19 at 14:48

kcw78 · Answer 1 · 2022-11-02T00:26:49.480

These examples show how to use h5py to copy datasets between 2 HDF5 files. See my other answer for PyTables examples. I created some simple HDF5 files to mimic CSV type data (all floats, but the process is the same if you have mixed data types). Based on your description, each file only has one dataset. When you have multiple datasets, you can extend this process with visititems() in h5py.

Note: code to create the HDF5 files used in the examples is at the end.

All methods use glob() to find the HDF5 files used in the operations below.

Method 1: Create External Links
This results in 3 Groups in the new HDF5 file, each with an external link to the original data. This does not copy the data, but provides access to the data in all files via the links in 1 file.

with h5py.File('table_links.h5',mode='w') as h5fw:
    link_cnt = 0 
    for h5name in glob.glob('file*.h5'):
        link_cnt += 1
        h5fw['link'+str(link_cnt)] = h5py.ExternalLink(h5name,'/')

Method 2a: Copy Data 'as-is'
(26-May-2020 update: This uses the .copy() method for all datasets.)
This copies the data from each dataset in the original file to the new file using the original dataset names. It loops to copy ALL root level datasets. This requires datasets in each file to have different names. The data is not merged into one dataset.

with h5py.File('table_copy.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        for obj in h5fr.keys():        
            h5fr.copy(obj, h5fw)

Method 2b: Copy Data 'as-is'
(This was my original answer, before I knew about the .copy() method.)
This copies the data from each dataset in the original file to the new file using the original dataset name. This requires datasets in each file to have different names. The data is not merged into one dataset.

with h5py.File('table_copy.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        h5fw.create_dataset(dset1,data=arr_data)

Method 3a: Merge all data into 1 Fixed size Dataset
This copies and merges the data from each dataset in the original file into a single dataset in the new file. In this example there are no restrictions on the dataset names. Also, I initially create a large dataset and don't resize. This assumes there are enough rows to hold all merged data. Tests should be added in production work.

with h5py.File('table_merge.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        h5fw.require_dataset('alldata', dtype="f",  shape=(50,5), maxshape=(100, 5) )
        h5fw['alldata'][row1:row1+arr_data.shape[0],:] = arr_data[:]
        row1 += arr_data.shape[0]

Method 3b: Merge all data into 1 Resizeable Dataset
This is similar to method above. However, I create a resizeable dataset and enlarge based on the amount of data that is read and added.

with h5py.File('table_merge.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = h5py.File(h5name,'r') 
        dset1 = list(h5fr.keys())[0]
        arr_data = h5fr[dset1][:]
        dslen = arr_data.shape[0]
        cols = arr_data.shape[1]
        if row1 == 0: 
            h5fw.create_dataset('alldata', dtype="f",  shape=(dslen,cols), maxshape=(None, cols) )
        if row1+dslen <= len(h5fw['alldata']) :
            h5fw['alldata'][row1:row1+dslen,:] = arr_data[:]
        else :
            h5fw['alldata'].resize( (row1+dslen, cols) )
            h5fw['alldata'][row1:row1+dslen,:] = arr_data[:]
        row1 += dslen

To create the source files read above:

for fcnt in range(1,4,1):
    fname = 'file' + str(fcnt) + '.h5'
    arr = np.random.random(50).reshape(10,5)
    with h5py.File(fname,'w') as h5fw :
        h5fw.create_dataset('data_'+str(fcnt),data=arr)

When using method 2 (Copy Data 'as-is') I guess hdf5-dataset properties like chunks and compression level are not copied as well. Would you happen to know a way to copy it without having to specify each property? — F.Wessels, Mar 24 '20 at 08:08
You are correct. Method 2 creates a new dataset, then copies the data from the first dataset. So, you would have to get the properties, then use when you create the new dataset. At the time I wrote that response, I was not aware of the h5py `.copy()` method to copy groups and datasets. I suspect a new dataset created with `.copy()` will inherit the properties -- but you should test to confirm. (It's similar to the PyTables `copy_children()` method below.) I need to update my answer to add that method. — kcw78, Mar 24 '20 at 13:27
Note: I recently posted an answer that describes how to do this.Take a look at this answer for details: [quickly-extract-tables-to-a-different-hdf5-file](https://stackoverflow.com/a/60792094/10462884) — kcw78, Mar 24 '20 at 13:27
@kcw78 Let's say that I have some HDF5-files that all have the same structure, i.e. they have the same number of keys (all of which have the same name). Per key, there are several datasets, but again, they have the same names across the HDF5-files. Ideally, I would now like to obtain a single HDF5-file that has the same number of keys (with the same names), and per key has the number of datasets. The only thing that should change is the shape of the datasets, since the final HDF5-file is a combination of the single ones. Could you please show how your code would have to updated for Method 3a? — Hermi, Mar 11 '22 at 14:47
You already wrote that `visititems()` would have to be used for several datasets, but I'm additionally considering several keys of the HDF5-files. — Hermi, Mar 11 '22 at 14:48
@Hermi, I adapted the existing solution to do something like you described. See new answer I added today. Note, it doesn't use `.visititems()`. I found a better solution using a generator. — kcw78, Mar 11 '22 at 22:13
Is there a small typo in the `h5r.copy(obj, h5fw)` ? Should it be `h5fr.copy(obj, h5fw)` The variable name should have an `f` character right ? — Hakan Baba, Nov 01 '22 at 23:40
@Hakan Baba, if you are referring to Method 2a - yes that is a typo. Good eyes. I will fix it now. — kcw78, Nov 02 '22 at 00:25

score 2 · Answer 2 · answered Oct 08 '19 at 20:09

For those that prefer using PyTables, I redid my h5py examples to show different ways to copy data between 2 HDF5 files. These examples use the same example HDF5 files as before. Each file only has one dataset. When you have multiple datasets, you can extend this process with walk_nodes() in Pytables.

All methods use glob() to find the HDF5 files used in the operations below.

Method 1: Create External Links
Similar to h5py, it creates 3 Groups in the new HDF5 file, each with an external link to the original data. The data is NOT copied.

import tables as tb
with tb.File('table_links_2.h5',mode='w') as h5fw:
    link_cnt = 0 
    for h5name in glob.glob('file*.h5'):
        link_cnt += 1
        h5fw.create_external_link('/', 'link'+str(link_cnt), h5name+':/')

Method 2: Copy Data 'as-is'
This copies the data from each dataset in the original file to the new file using the original dataset name. Dataset object is the same type as source HDF5 file. In this case, they are PyTable Arrays (because all columns are the same type). The datasets are copied using the name in the source HDF5 so each must have different names. The data is not merged into a single dataset.

with tb.File('table_copy_2.h5',mode='w') as h5fw:
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        print (h5fr.root._v_children)
        h5fr.root._f_copy_children(h5fw.root)

Method 3a: Merge all data into 1 Array
This copies and merges the data from each dataset in the original file into a single dataset in the new file. Again, the data is saved as a PyTables Array. There are no restrictions on the dataset names. First I read the data and append to a Numpy array. Once all files have been processed, the Numpy array is copied to the PyTables Array. This process holds the Numpy array in memory, so may not work for large datasets. You can avoid this limitation by using a Pytables EArray (Enlargeable Array). See Method 3b.

with tb.File('table_merge_2a.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        dset1 = h5fr.root._f_list_nodes()[0]
        arr_data = dset1[:]
        if row1 == 0 :
           all_data = arr_data.copy()
           row1 += arr_data.shape[0]
        else :
           all_data = np.append(all_data,arr_data,axis=0)
           row1 += arr_data.shape[0]
    tb.Array(h5fw.root,'alldata', obj=all_data )

Method 3b: Merge all data into 1 Enlargeable EArray
This is similar to the method above, but saves the data incrementally in a PyTables EArray. The EArray.append() method is used to add the data. This process reduces the memory issues in Method 3a.

with tb.File('table_merge_2b.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        dset1 = h5fr.root._f_list_nodes()[0]
        arr_data = dset1[:]
        if row1 == 0 :
           earr = h5fw.create_earray(h5fw.root,'alldata', 
                                     shape=(0,arr_data.shape[1]), obj=arr_data )
        else :
           earr.append(arr_data)
        row1 += arr_data.shape[0]

Method 4: Merge all data into 1 Table
This example highlights the differences between h5py and PyTables. In h5py, the datasets can reference np.arrays or np.recarrays -- h5py deals with the different dtypes. In Pytables, Arrays (and CArrays and EArrays) reference nd.array data, and Tables reference np.recarray data. This example shows how to convert the nd.array data from the source files into np.recarray data suitable for Table objects. It also shows how to use Table.append() similar to EArray.append() in Method 3b.

with tb.File('table_append_2.h5',mode='w') as h5fw:
    row1 = 0
    for h5name in glob.glob('file*.h5'):
        h5fr = tb.File(h5name,mode='r') 
        dset1 = h5fr.root._f_list_nodes()[0]
        arr_data = dset1[:]
        ds_dt= ([ ('f1', float), ('f2', float), ('f3', float), ('f4', float), ('f5', float) ])
        recarr_data = np.rec.array(arr_data,dtype=ds_dt)
        if row1 == 0: 
            data_table = h5fw.create_table('/','alldata', obj=recarr_data)
        else :
            data_table.append(recarr_data)
        h5fw.flush()
        row1 += arr_data.shape[0]

score 1 · Answer 3 · answered Oct 02 '19 at 04:58

1

There are at least 3 ways to combine data from individual HDF5 files into a single file:

Use external links to create a new file that points to the data in your other files (requires pytables/tables module)
Copy the data with the HDF Group utility: h5copy.exe
Copy the data with Python (using h5py or pytables)

An example of external links is available here:
https://stackoverflow.com/a/55399562/10462884
It shows how to create the links and then how to dereference them.

Documentation for h5copy is here:
https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Copy

Copying with h5py or pytables is more involved.

answered Oct 02 '19 at 04:58

kcw78

7,131
3
12
44

I figured out a method, please let me know if this is the correct way to do it: Firstly I read my .h5 file using Pandas and then use pandas's "to_csv()" function to save it into csv format. Combining several csv files is much easier than .h5 file and the file size almost remains the same. Is this one of the ways to do it? – ktt_11 Oct 02 '19 at 07:02
If it works for you, then that's another option (especially if you only want to do this one time and can use the CSV file in your process). However, if I had to do this frequently or need a HDF5 downstream, I would use one of the methods above to avoid creating and combining the csv files. – kcw78 Oct 02 '19 at 12:56

kcw78 · Answer 4 · 2022-03-13T15:15:54.173

This answer is written to address @Hermi's request to merge data from several files with several groups and datasets (dtd: March 11, 2022). The general case is a complicated problem -- lots of error checking is required to insure consistent group names, dataset names, and dataset properties (dtype and shape). Also, the dataset can be "extend" in multiple directions to hold the merged data (and there is no 1 right answer).

Code below does the following:

If the source group/dataset doesn't exist in the merged file, it copies the data to the new file with the same group/dataset path. Dataset shape is extended to add another dimension(axis), which is resizable to add data in the future.
If the source group/dataset exists in the merged file, the existing dataset properties are tested for compatibility. If compatible, the dataset shape is incremented by 1 on the N-dimension(axis), and the data is added to that slice.
Note: It adds dataset attributes with the source file names (for future reference). However, it does not copy dataset or group attributes from the source files.
It uses a modified version the generator h5py_dataset_iterator() from How to differentiate between HDF5 datasets and groups. This generator is a better solution due to .visititems() behavior with return and yield.
Note: This procedure copies datasets, but will not copy empty Groups (those that don't have Datasets). The generator will need modification if this is required.

Code below:

import h5py
import glob

# Ref: https://stackoverflow.com/a/34401029/10462884
# with slight modifications
def h5py_dataset_iterator(g, prefix=''):
    for name, h5obj in g.items():
        path = '{}/{}'.format(prefix, name)
        if isinstance(h5obj, h5py.Dataset):  # test for dataset
            yield (h5obj, path)
        elif isinstance(h5obj, h5py.Group):  # test for group (go down)
            yield from h5py_dataset_iterator(h5obj, prefix=path)


with h5py.File('merged_h5_data.h5', mode='w') as h5w:            
    for h5source in glob.iglob('file*.h5'):
        print(f'\nWorking on file: {h5source}')
        with h5py.File(h5source, mode='r') as h5r:   
            for (dset, path) in h5py_dataset_iterator(h5r):
                print(f'Copying dataset from: {path}')
                ds_obj = h5r[path]
                arr_dtype = ds_obj.dtype
                arr_shape = ds_obj.shape
    
                # If dataset doesn't exist, create new datset and copy data
                # Note: we can't use .copy() method b/c we are changing shape and maxshape
                if path not in h5w:
                    h5w.create_dataset(path, data=ds_obj,
                                       shape=arr_shape+(1,), maxshape=arr_shape+(None,))
                else:
                    # Check for compatiable dtype and shape
                    ds_dtype = h5w[path].dtype
                    ds_shape = h5w[path].shape
                    # If dataset exists and is compatibale, resize datset and copy data
                    if ds_dtype == arr_dtype and ds_shape[:-1] == arr_shape:
                        new_shape = ds_shape[0:-1] + (ds_shape[-1]+1,)
                        h5w[path].resize(new_shape)               
                        h5w[path][...,-1] = ds_obj
                # Add attribute to dataset with source file name:
                h5w[path].attrs[f'File for index {h5w[path].shape[-1]-1} '] = h5source

Code to create source files used above:
Example 1: Simple schema (2 groups w/ 3 datasets)

import h5py
import numpy as np

for fcnt in range(1,4,1):
    fname = 'file' + str(fcnt) + '.h5'
    with h5py.File(fname,'w') as h5fw:
        for gcnt in range(1,3,1):  
            grp = h5fw.create_group(f'group_{gcnt}')
            for dcnt in range(1,4,1):
                arr = np.random.randint(0,high=255,size=100,dtype=np.uintc).reshape(10,10)
                grp.create_dataset(f'dataset_{dcnt}',data=arr)

Example 2: Advanced schema (3 levels with Groups and Datasets)

import h5py
import numpy as np

ds_list = ['/dataset_1', '/dataset_2', 
           '/group_1/group_11/dataset_1', '/group_1/group_11/dataset_2', 
           '/group_1/group_12/dataset_1', '/group_1/group_12/dataset_2', 
           '/group_2/dataset_1', '/group_2/dataset_2',
           '/group_3/group_31/group_311/dataset_1', 
           '/group_3/group_31/group_312/dataset_1']
for fcnt in range(1,4,1):
    fname = 'cfile' + str(fcnt) + '.h5'
    with h5py.File(fname,'w') as h5fw:
        for name in ds_list:
            arr = np.random.randint(0,high=255,size=100,dtype=np.uintc).reshape(10,10)
            h5fw.create_dataset(name, data=arr)

To be honest, I'm inexperienced with h5py, so this question might be deemed trivial/stupid, but I was just wondering: couldn't one simply enumerate over all keys of the original h5py-file, and thus obtain the groups, and then iterate over the keys of the groups to obtain the datasets? Would that be a valid alternative to your function `h5py_dataset_iterator`? — Hermi, Mar 12 '22 at 21:30
Your question is not trivial or dumb. Your approach works for this simple schema - only Groups at the root level, and Groups only have Datasets. Your approach **would not work** for a file with a deeper/more complicated schema. Example: 3 Groups and 2 Datasets at the root level. Group 1 has 2 more groups each with datasets. Group 2 has 2 groups and 2 datasets. And, Group 3 has 1 group with 2 more subgroups which have datasets. You need recursion to descend that object tree. A generator is perfect for that. :-) PyTables has a built-in method to do this (`Group.walk_nodes()`). — kcw78, Mar 13 '22 at 01:59
I added a 2nd example that mimics the more complicated schema described in my previous comments. The code to merge the files works perfectly - no changes required. Note the caveat that empty groups are not copied. — kcw78, Mar 13 '22 at 15:18

How can I combine multiple .h5 file?

4 Answers4

Linked