1

I have several HDF5 files all of which have a /dataset that contains vectors. I would like to combine all these vectors into one dataset in one file (that is repeatedly append from one file to another). The combined dataset would have chunked storage and be resizable.

Every option I've seen for doing this seems to require reading all the data into a buffer, and then writing it back out, is there a way to more simply pass a dataset/dataspace from one file to another in order to append the data?

evolvedmicrobe
  • 2,672
  • 2
  • 22
  • 30
  • I'm afraid you will have to read each `dataset` (from each file) before adding it into the combined dataset. That said, would you mind to tell if each `dataset` is of the same data type and dimensions? If so, I can post a solution either in `C#` or `R` (which seems to be the programming languages you are posting the most about) that solves your issue. – SOG Jun 17 '21 at 22:11
  • dataset is same datatype and is a one-dimensional vector, but the size of that vector is different across each dataset. – evolvedmicrobe Jun 17 '21 at 22:41

2 Answers2

1

Have you investigated h5py Group .copy() method? Although documented as a group action, it works with any h5py object (groups, datasets, links and references). By default it copies object attributes, and supports recursive copying of group members. If you prefer a command line tool, the HDF Group has one to do this. Take a look at h5copy here: HDF5 Group h5 copy doc

Here is a example that demonstrates a simple h5py .copy() implementation. It creates a set of 3 files -- each with 1 dataset (named /dataset, dtype=float, shape=(10,10)). It then creates a NEW HDF5 file, and is followed by another loop to open the previous files and copies the dataset from the "read" file (h5r) to the new "write" file (h5w).

for i in range (1,4):
    with h5py.File('SO_68025342_'+str(i)+'.h5',mode='w') as h5f:
        arr = np.random.random(100).reshape(10,10)
        h5f.create_dataset('dataset',data=arr)

with h5py.File('SO_68025342_all.h5',mode='w') as h5w:
    for i in range (1,4):
        with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
            h5r.copy('dataset', h5w, name='dataset_'+str(i) )

Here is a method to copy data from multiple files to a single dataset in the merged file. It comes with caveats: 1) all datasets must have the same shape, and 2) you know the number of datasets in advance to size the new dataset. (If not, you can create a resizeable dataset by addingmaxshape=(None,a0,a1), and then use .resize() as needed. I have another post with 2 examples here: How can I combine multiple .h5 file? Look at Methods 3a and 3b.

with h5py.File('SO_68025342_merge.h5',mode='w') as h5w:
    for i in range (1,4):
        with h5py.File('SO_68025342_'+str(i)+'.h5',mode='r') as h5r:
            if 'dataset' not in h5w.keys():
                a0, a1 = h5r['dataset'].shape
                h5w.create_dataset('dataset', shape=(3,a0,a1))
            h5w['dataset'][i-1,:] = h5r['dataset']

Assuming your files aren't so conveniently named, you can use glob.iglob() to loop on the file names to read. Then use .keys() to get the dataset names in each file. Also, if all of your datasets really are named /dataset, you need to come up with a naming convention for the new datasets.

Here is a link to the h5py docs with more details: h5py Group .copy() method

kcw78
  • 7,131
  • 3
  • 12
  • 44
  • This is a really good solution for adding them all into one file, but I should have clarified (and have now edited question to do so), that I want to add concatenate the separate datasets into one dataset within the output file. So end result would be one dataset in one file, instead of multiple datasets in one. – evolvedmicrobe Jun 17 '21 at 22:08
  • OK, I'll see you and raise you. :-). I modified my answer to demonstrate merging data. This takes more care to keep everything aligned, and not overwrite previous data. Also, do you need to track the source? If so, look into attributes to tag the source file for each index. I have an example of that 'somewhere' on SO. – kcw78 Jun 17 '21 at 22:50
  • Thank you! Your solution is actually similar to what I was originally trying, I believe though it's always the case that `h5r['dataset'][:]` will allocate a numpy array and fill it with all of the data before copying over (or at least when I try it does), and I'm trying to avoid loading that much data at once, or manually chunking it for read/writes – evolvedmicrobe Jun 17 '21 at 23:20
  • I was wondering if _"combine without an intermediate buffer"_ really meant don't load data into memory. I don't think that's possible. I'm not aware of any technique to transfer data between files without an in-memory copy of the data. I suspect your real problem is copying very large datasets (that don't fit in memory) and you need a way to copy them incrementally. I _**know**_ I have posted an answer in SO that shows how to do this -- but can't find it tonight. It's basically the same technique, but using NumPy slice notation to read and write slices of the dataset. – kcw78 Jun 18 '21 at 01:11
0

If you are not bound to a particular library and programming language, one way to solve your issue could be with the usage of HDFql (in C, C++, Java, Python, C#, Fortran or R).

Given that your posts seem to mention C# quite often, find below a solution in C#. It assumes that 1) the dataset name is dset, 2) each dataset is of data type float, and 3) each dataset is a vector of one dimension (size 100) - feel free to adapt the code to your concrete use-case:

// declare variable
float []data = new float[100];

// retrieve all file names (from current directory) that end with '.h5'
HDFql.Execute("SHOW FILE LIKE \\.h5$");

// create an HDF5 file named 'output.h5' and use (i.e. open) it
HDFql.Execute("CREATE AND USE FILE output.h5");

// create a chunked and extendible HDF5 dataset named 'dset' in file 'output.h5'
HDFql.Execute("CREATE CHUNKED(100) DATASET dset AS FLOAT(0 TO UNLIMITED)");

// register variable 'data' for subsequent usage (by HDFql)
HDFql.VariableRegister(data);

// loop cursor and process each file found
while(HDFql.CursorNext() == HDFql.Success)
{
   // alter (i.e. extend) dataset 'dset' (from file 'output.h5') with more 100 floats
   HDFql.Execute("ALTER DIMENSION dset TO +100");

   // select (i.e. read) dataset 'dset' (from file found) and populate variable 'data'
   HDFql.Execute("SELECT FROM \"" + HDFql.CursorGetChar() + "\" dset INTO MEMORY " + HDFql.VariableGetNumber(data));

   // insert (i.e. write) values stored in variable 'data' into dataset 'dset' (from file 'output.h5') at the end of it (using an hyperslab)
   HDFql.Execute("INSERT INTO dset(-1:::) VALUES FROM MEMORY " + HDFql.VariableGetNumber(data));
}
SOG
  • 876
  • 6
  • 10