1

I have a h5 file which contains a dataset like this:

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2

I have another h5 file with the same columns:

col1.      col2.      col3
 6           1          9
 8           2          7

and I would like to concatenate these two to have the following h5 file:

col1.      col2.      col3
 1           3          5
 5           4          9
 6           8          0
 7           2          5
 2           1          2
 6           1          9
 8           2          7

What is the most efficient way to do this if files are huge or we have many of these merges?

A.M.
  • 1,757
  • 5
  • 22
  • 41
  • `h5_1.append(h5_2)`? – wwnde Feb 03 '21 at 21:40
  • Are they pandas dataframes? If so `h5_concat = pandas.concat(h5_1, h5_2)`. In time, this is **not** merging. It is concatenation – Paulo Marques Feb 03 '21 at 21:41
  • they are not pandas dataframes. They are two h5 files. – A.M. Feb 03 '21 at 21:41
  • `pd.concat([h5_1,h5_2], axis=0)` – wwnde Feb 03 '21 at 21:43
  • @wwnde are you suggesting to turn h5 files into pandas dataframe first? – A.M. Feb 03 '21 at 21:44
  • I thought thats why you tagged your question `pandas.....` – wwnde Feb 03 '21 at 21:46
  • I tagged it because I thought it might be the case that the best way is to convert to pandas dataframe first but I am not sure about this. – A.M. Feb 03 '21 at 21:47
  • If you have `numpy` arrays, you can read/write with to the file with `h5py`, and work directly with `group` and `dataset`. Some datasets can 'grow', otherwise you need to read the datasets, combine them in `numpy` and write new ones. `pandas` uses `tables` to read/write to `HDF5` files, For a `h5py/numpy` question, state clearly the datasets `shape` and `dtype`, and if possible sample code the writes or reads them. – hpaulj Feb 03 '21 at 21:57

1 Answers1

1

I'm not familiar with pandas, so can't help there. This can be done with h5py or pytables. As @hpaulj mentioned, the process reads the dataset into a numpy array then writes to a HDF5 dataset with h5py. The exact process depends on the maxshape attribute (it controls if the dataset can be resized or not).

I created examples to show both methods (fixed size or resizeable dataset). The first method creates a new file3 that combines the values from file1 and file2. The second method adds the values from file2 to file1e (that is resizable). Note: code to create the files used in the examples is at the end.

I have a longer answer on SO that shows all the ways to copy data.
See this Answer: How can I combine multiple .h5 file?

Method 1: Combine datasets into a new file
Required when the datasets were not created with maxshape= parameter

with h5py.File('file1.h5','r') as h5f1,  \
     h5py.File('file2.h5','r') as h5f2,  \
     h5py.File('file3.h5','w') as h5f3 :
         
    print (h5f1['ds_1'].shape, h5f1['ds_1'].maxshape)
    print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    

    arr1_a0 = h5f1['ds_1'].shape[0]            
    arr2_a0 = h5f2['ds_2'].shape[0]            
    arr3_a0 = arr1_a0 + arr2_a0          
    h5f3.create_dataset('ds_3', dtype=h5f1['ds_1'].dtype,
                        shape=(arr3_a0,3), maxshape=(None,3))

    xfer_arr1 = h5f1['ds_1']               
    h5f3['ds_3'][0:arr1_a0, :] = xfer_arr1
 
    xfer_arr2 = h5f2['ds_2']   
    h5f3['ds_3'][arr1_a0:arr3_a0, :] = xfer_arr2

    print (h5f3['ds_3'].shape, h5f3['ds_3'].maxshape)

Method 2: Appended file2 dataset to file1 dataset
The datasets in file1e must be created with maxshape= parameter

with h5py.File('file1e.h5','r+') as h5f1, \
     h5py.File('file2.h5','r') as h5f2 :

    print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)
    print (h5f2['ds_2'].shape, h5f2['ds_2'].maxshape)    
    
    arr1_a0 = h5f1['ds_1e'].shape[0]            
    arr2_a0 = h5f2['ds_2'].shape[0] 
    arr3_a0 = arr1_a0 + arr2_a0          

    h5f1['ds_1e'].resize(arr3_a0,axis=0)
    
    xfer_arr2 = h5f2['ds_2']   
    h5f1['ds_1e'][arr1_a0:arr3_a0, :] = xfer_arr2

    print (h5f1['ds_1e'].shape, h5f1['ds_1e'].maxshape)

Code to create the example files used above:

import h5py
import numpy as np

arr1 = np.array([[ 1, 3, 5 ],
                 [ 5, 4, 9 ],
                 [ 6, 8, 0 ],
                 [ 7, 2, 5 ],
                 [ 2, 1, 2 ]] )

with h5py.File('file1.h5','w') as h5f:
    h5f.create_dataset('ds_1',data=arr1)
    print (h5f['ds_1'].maxshape)   
    
with h5py.File('file1e.h5','w') as h5f:
    h5f.create_dataset('ds_1e',data=arr1, shape=(5,3), maxshape=(None,3))
    print (h5f['ds_1e'].maxshape)             
                 
arr2 = np.array([[ 6, 1, 9 ],
                 [ 8, 2, 7 ]] )
                 
with h5py.File('file2.h5','w') as h5f:
    h5f.create_dataset('ds_2',data=arr2)
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • h5 files store data in datasets. `h5f1.keys()` yields a list of object names at the root level. In your case they are datasets named 'col1', 'col2', 'col3'. Does `h5f2.keys()` yield the same names? If so, do you want to combine the data from `h5f2['col1 ']` to `h5f1['col1 ']`, and the same for 'col2', and 'col3'? If so, it's the same process for 3 datasets. Do I need to modify my example to show how to iterate thru the keys/datasets? It will be "slightly more complicated". – kcw78 Feb 04 '21 at 17:01
  • Thanks for your answer. Would you please let me know if there is any way to append `h5f2['col1 ']` to `h5f1['col1 ']` directly instead of creating a new dataset as `h5f3['col1']` and adding these two sequentially to it? – A.M. Feb 04 '21 at 17:13
  • The 2nd part of the example does that. It opens `'file1e.h5'` in append mode:`r+`, resizes the dataset, then appends data from `'file2.h5'` . Appending to a dataset requires that it was defined as "resizable" when it was initially created (using the `maxshape=` parameter as shown in the example). The value for the 0-axis has to be either: a) `None` which allows unlimited size, or b) a value greater than the sum of `h5f1['col1 ']` and `h5f2['col1 ']`. You need to this check this attribute for all 3 datasets in your file. – kcw78 Feb 04 '21 at 17:38
  • The second part is what I was looking for. Thank you so much for the help. – A.M. Feb 04 '21 at 19:47