3

I have several hdf5 files with the same shape, containing x and y columns. I need to append those, to get one hdf5 file, which contains all the data.

My code so far:

def append_to_h5(new_file, file_list):
    f = h5py.File(new_file, 'a')
    for file in file_list:
        with h5py.File(file, 'r') as d:
            f.create_dataset("./", data=d)
    f.close()

#new_file <- is a file path to the new hdf5 file
#file_list <- contains all the pathes of the hdf5 files, which I want to append

The error

   in make_new_dset tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1717, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U1')

Any ideas are appreciated Thanks

Jürgen K.
  • 3,427
  • 9
  • 30
  • 66
  • Your code doesn't make sense. `d` is the opened file. `create_dataset` is used to create (and write) ONE array. It can't be used to copy a whole file or even a group to the new file. I think you need to spend some more time reading the `h5py` docs. – hpaulj May 26 '20 at 19:46
  • The solution depends on how you want to handle the data in the datasets from each HDF5 file. For example, do you want to copy the datasets to the same dataset name in the common HDF5 file (and they have unique names)? Or do you want to extract the data from each dataset/file and append to a single dataset in the common file? Have you considered external links? Check out this answer for a review of 4 different methods: [**SO 10462884**](https://stackoverflow.com/a/58223603/10462884). There is also an answer using pytables. – kcw78 May 26 '20 at 20:16
  • Also, as hpaulj noted, your inner loop (on `d`) loops on the list of filenames. You need at least 1 more nested loop for each file to loop on the datasets at the root level (using `d.keys()`) – kcw78 May 26 '20 at 20:32

1 Answers1

0

This is covered more extensively other SO answers. I created a short example to get you started. The primary change was adding a loop to find and copy top level datasets (only). It assumes there will be no dataset name conflicts and needs tests to be used for a general purpose case. Also, I changed your file object variable names.

def append_to_h5(new_file, file_list):
    f1 = h5py.File(new_file, 'a')
    for file in file_list:
        with h5py.File(file, 'r') as f2:
            for ds in f2.keys():
                f2.copy(ds, f1) 
    f1.close()

#new_file <- is a file path to the new hdf5 file
#file_list <- contains all the pathes of the hdf5 files, which I want to append
kcw78
  • 7,131
  • 3
  • 12
  • 44