5

I have many hdf5 files each with a single dataset on them. I want to combine them into one dataset where the data is all in the same volume (each file is an image, I want one large timelapse image).

I wrote a python script to extract the data as a numpy array, store them, then try to write that to a new h5 file. However, this approach will not work because the combined data uses more than the 32 GB of RAM that I have.

I also tried using h5copy, the command line tool.

h5copy -i file1.h5 -o combined.h5 -s '/dataset' -d '/new_data/t1'
h5copy -i file2.h5 -o combined.h5 -s '/dataset' -d '/new_data/t2'

Which works, but it results in many datasets within the new file instead of having all of the datasets in series.

1 Answers1

2

Although you can't explicitly append rows to an hdf5 dataset, you can use the maxshape keyword to your advantage when creating your dataset in a way that will allow you to 'resize' the dataset to accomodate new data. (See http://docs.h5py.org/en/latest/faq.html#appending-data-to-a-dataset)

Your code will end up looking something like this, assuming the number of columns for your dataset is always the same:

import h5py

output_file = h5py.File('your_output_file.h5', 'w')

#keep track of the total number of rows
total_rows = 0

for n, f in enumerate(file_list):
  your_data = <get your data from f>
  total_rows = total_rows + your_data.shape[0]
  total_columns = your_data.shape[1]

  if n == 0:
    #first file; create the dummy dataset with no max shape
    create_dataset = output_file.create_dataset("Name", (total_rows, total_columns), maxshape=(None, None))
    #fill the first section of the dataset
    create_dataset[:,:] = your_data
    where_to_start_appending = total_rows

  else:
    #resize the dataset to accomodate the new data
    create_dataset.resize(total_rows, axis=0)
    create_dataset[where_to_start_appending:total_rows, :] = your_data
    where_to_start_appending = total_rows

output_file.close()
Heather QC
  • 680
  • 8
  • 11
  • what is –  Jul 20 '17 at 08:34
  • It would be whatever command or steps you need to do to get your data from each file and will be dependent on what kind of file it is. For example, if you are working with a list of HDF5 files, it requires using h5py.File to create a file object and then reading data from the file with something like file_object["dataset_name"][slice] – Heather QC Jul 21 '17 at 12:25