11

I have about 500 HDF5 files each of about 1.5 GB.

Each of the files has the same exact structure, which is 7 compound (int,double,double) datasets and variable number of samples.

Now I want to concatenate all this files by concatenating each of the datasets so that at the end I have a single 750 GB file with my 7 datasets.

Currently I am running a h5py script which:

  • creates a HDF5 file with the right datasets of unlimited max
  • open in sequence all the files
  • check what is the number of samples (as it is variable)
  • resize the global file
  • append the data

this obviously takes many hours, would you have a suggestion about improving this?

I am working on a cluster, so I could use HDF5 in parallel, but I am not good enough in C programming to implement something myself, I would need a tool already written.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Andrea Zonca
  • 8,378
  • 9
  • 42
  • 70
  • One possibility is merging together pairs of files on your cluster; reduce the problem to 250 3GB files, then 125 6 GB files, and so on. This only helps if partially merged files provides any amount of time saving when merging the results later on. – sarnold Mar 18 '11 at 00:02
  • @sarnold I am working on hopper at NERSC, theoretical I/O speed is 25 GB/s, also the filesystem is fully parallel and supports MPI I/O. – Andrea Zonca Mar 18 '11 at 03:03
  • I was thinking to read maybe 3 or 4 files at a time and write them back all together, but the best would be a c utility that exploits somehow mpi I/O. – Andrea Zonca Mar 18 '11 at 03:05
  • Andrea, I am speechless. I figured an array of excellent drives still wouldn't go past a gigabyte per second... – sarnold Mar 18 '11 at 03:06
  • 2
    One feature hdf5 has is that you can "mount" several subfiles in a "folder" of the master file. That way it might not be needed to merge them all together into one file. See here: http://davis.lbl.gov/Manuals/HDF5-1.4.3/Tutor/mount.html – schoetbi Mar 19 '11 at 20:08
  • thanks @schoetbi but I want to concatenate the datasets in order to have a single huge array – Andrea Zonca Mar 19 '11 at 20:12
  • @AndreaZonca Could you please post a copy of your script for this? I am currently trying to do something similar and this sounds like it would be very helpful. – okarin Jul 02 '14 at 22:25
  • 1
    See this snippet: https://gist.github.com/zonca/8e0dda9d246297616de9 – Andrea Zonca Jul 03 '14 at 16:54

3 Answers3

9

I found that most of the time was spent in resizing the file, as I was resizing at each step, so I am now first going trough all my files and get their length (it is variable).

Then I create the global h5file setting the total length to the sum of all the files.

Only after this phase I fill the h5file with the data from all the small files.

now it takes about 10 seconds for each file, so it should take less than 2 hours, while before it was taking much more.

Andrea Zonca
  • 8,378
  • 9
  • 42
  • 70
1

I get that answering this earns me a necro badge - but things have improved for me in this area recently.

In Julia this takes a few seconds.

  1. Create a txt file that lists all the hdf5 file paths (you can use bash to do this in one go if there are lots)
  2. In a loop read each line of txt file and use label$i = h5read(original_filepath$i, "/label")
  3. concat all the labels label = [label label$i]
  4. Then just write: h5write(data_file_path, "/label", label)

Same can be done if you have groups or more complicated hdf5 files.

ashley
  • 1,535
  • 1
  • 14
  • 19
1

Ashley's answer worked well for me. Here is an implementation of her suggestion in Julia:

Make text file listing the files to concatenate in bash:

ls -rt $somedirectory/$somerootfilename-*.hdf5 >> listofHDF5files.txt

Write a julia script to concatenate multiple files into one file:

# concatenate_HDF5.jl
using HDF5

inputfilepath=ARGS[1]
outputfilepath=ARGS[2]

f = open(inputfilepath)
firstit=true
data=[]
for line in eachline(f)
    r = strip(line, ['\n'])
    print(r,"\n")
    datai = h5read(r, "/data")
    if (firstit)
        data=datai
        firstit=false
    else
        data=cat(4,data, datai) #In this case concatenating on 4th dimension
    end
end
h5write(outputfilepath, "/data", data)

Then execute the script file above using:

julia concatenate_HDF5.jl listofHDF5files.txt final_concatenated_HDF5.hdf5
user185160
  • 856
  • 7
  • 7