Chain datasets from multiple HDF5 files/datasets

Question

The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.

I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).

For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.

Here is some pseudocode:

import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))

# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)

For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.

Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?

Explain more of what you are trying to do with this data ('read on demand') and why a numpy concatenate (the most obvious solution) is wrong. — hpaulj, Feb 21 '16 at 17:36
Is there a compelling reason not to store all of your data in a single 2D `Dataset`, then access the individual sub-arrays using slice indexing? — ali_m, Feb 21 '16 at 18:04
@ali_m, the individual datasets analysis are generated based on incremental analysis, and sometimes new datasets are created too. If I had to concatenate and store all the datasets, then I would have to do this each time a new dataset is created or changed. And they a single one may be changed often. Also, by reading from separate files allows an easy way of filtering out or selecting datasets. I'd appreciate any insights into how to chain the datasets. — achennu, Feb 22 '16 at 12:29
@hpaulj, a concatenation of numpy arrays would certainly be my fallback solution if I cannot find a way. For instance, if I run ``np.concatenate([a, b, c])`` on the h5py datasets, does this read them into memory? If so, then this solution would not work for the large amount of data. — achennu, Feb 22 '16 at 12:31
You could create an array of object or region references in the HDF5 file. However, indexing this array would return the references, and not actually index the referenced arrays themselves, so you would probably want to create a wrapper class in python (either in addition to or instead of the array of references). Also, I'm not sure there's a way of slicing the combined array without also copying the data when you concatenate the subarrays, but the worst case is that you copy the data you want to view and not all the subarrays. — Yossarian, Mar 21 '16 at 11:03
@Yossarian, that sounds promising. Do you have any code/docs to share that specify how that is done? I'm not sure I quite understand exactly how HDF5 references can be used for my application. — achennu, Mar 22 '16 at 21:54
@achennu if you can create raw data blocks, and you do not need data compression or chunks, you can use H5P.set_external to achieve this effect. — Wang, Sep 16 '16 at 15:32
@Wang, thanks for that pointer. I have no MATLAB in my workflow as of now. Not sure I would switch over from python for this issue. — achennu, Sep 17 '16 at 13:25
I think this is standard hdf 5 API.h5py should support this also. You can try dask too, but it has not fully implement all numpy operations. — Wang, Sep 17 '16 at 13:41

score 0 · Answer 1 · answered Mar 23 '16 at 12:01

First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.

Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.

As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.

class MagicArray(object):
    """Magically index an array of references
    """
    def __init__(self, file, references, axis=0):
        self.file = file
        self.references = references
        self.axis = axis

    def __getitem__(self, items):
        # We need to modify the indices, so make sure items is a list
        items = list(items)

        for item in items:
            if hasattr(item, 'start'):
                # items is a slice object
                raise ValueError('Slices not implemented')

        for ref in self.references:
            size = self.file[ref].shape[self.axis]

            # Check if the requested index is in this subarray
            # If not, subtract the subarray size and move on
            if items[self.axis] < size:
                item_ref = ref
                break
            else:
                items[self.axis] = items[self.axis] - size

        return self.file[item_ref][tuple(items)]

Here's how you use it:

with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
    a = f.create_dataset('a',data=np.random.random((100, 50)))
    b = f.create_dataset('b',data=np.random.random((300, 50)))
    c = f.create_dataset('c',data=np.random.random((253, 50)))

    ref_dtype = h5py.special_dtype(ref=h5py.Reference)
    ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)

    for i, key in enumerate([a, b, c]):
        ref_dataset[i] = key.ref

with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
    foo = MagicArray(f, f['refs'], axis=0)
    print(foo[104, 4])
    print(f['b'][4,4])

This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.

You might be able to subclass from numpy.ndarray and get all the usual methods as well.

Chain datasets from multiple HDF5 files/datasets

1 Answers1

Linked