Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

Question

I think some of my question is answered here:1

But the difference that I have is that I'm wondering if it is possible to do the slicing step without having to re-write the datasets to another file first.

Here is the code that reads in a single HDF5 file that is given as an argument to the script:

with h5py.File(args.H5file, 'r') as df:
  print('Here are the keys of the input file\n', df.keys())
  #interesting point here: you need the [:] behind each of these and we didn't need it when
  #creating datasets not using the 'with' formalism above. Adding that even handled the cases 
  #in the 'hits' and 'truth_hadrons' where there are additional dimensions...go figure. 
  jetdset = df['jets'][:]
  haddset = df['truth_hadrons'][:]
  hitdset = df['hits'][:]

Then later I do some slicing operations on these datasets. Ideally I'd be able to pass a wild-card into args.H5file and then the whole set of files, all with the same data formats, would end up in the three datasets above.

I do not want to store or make persistent these three datasets at the end of the script as the output are plots that use the information in the slices.

Any help would be appreciated!

So, basically you want variables `jetdset`, `haddset` and `hitdset` to contain (in a contiguous fashion) all the data from the corresponding datasets stored in multiple HDF5 files? — SOG, Jun 23 '22 at 10:32
Yes, if possible. I understand that there might be limitations to my system that might make this impossible, but leave that for me to discover later. — physicscitizen, Jun 23 '22 at 10:48
The linked answer you provided shows how to merge multiple h5 files into a single _**file**_. If I understand, you want to combine data from multiple h5 files into a single _**array**_ (which is similar, but different). I just answered another question on that process. Take a look at my answer here: [Merging HDF5 files for faster data reading](https://stackoverflow.com/a/72734761/10462884) (don't let the Julia stuff distract you!) My answer reads the entire dataset into arrays, but you can use slice notation if you only want some of the data. — kcw78, Jun 23 '22 at 18:20
Thanks for this response! I will take a look and let you/people know if I have questions! I really appreciate the efforts you guys take to answer questions. — physicscitizen, Jun 24 '22 at 14:10

score 0 · Answer 1 · answered Jun 24 '22 at 01:28

There are at least 2 ways to access multiple files:

If all files follow a naming pattern, you can use the glob module. It uses wildcards to find files. (Note: I prefer glob.iglob; it is an iterator that yields values without creating a list. glob.glob creates a list which you frequently don't need.)
Alternatively, you could input a list of filenames and loop on the list.

Example of iglob:

import glob
for fname in glob.iglob('img_data_0?.h5'): 
    with h5py.File(fname, 'r') as h5f:
        print('Here are the keys of the input file\n', h5.keys())

Example with a list of names:

filenames = [ 'img_data_01.h5', 'img_data_02.h5', 'img_data_03.h5' ]
for fname in filenames: 
    with h5py.File(fname, 'r') as h5f:
        print('Here are the keys of the input file\n', h5.keys())

Next, your code mentions using [:] when you access a dataset. Whether or not you need to add indices depends on the object you want returned.

If you include [()], it returns the entire dataset as a numpy array. Note [()] is now preferred over [:]. You can use any valid slice notation, e.g., [0,0,:] for a slice of a 3-axis array.
If you don't include [:], it returns a h5py dataset object, which behaves like a numpy array. (For example, you can get dtype and shape, and slice the data). The advantage? It has a smaller memory footprint. I use h5py dataset objects unless I specifically need an array (for example, passing image data to another package).

Examples of each method:

        jets_dset = h5f['jets']     # w/out [()] returns a h5py dataset object
        jets_arr  = h5f['jets'][()] # with [()] returns a numpy array object

Finally, if you want to create a single array that merges values from 3 datasets, you have to create an array big enough to hold the data, then load with slice notation. Alternatively, you can use np.concatenate() (However, be careful, as concatenating a lot of data can be slow.)

A simple example is shown below. It assumes you know the shape of the dataset, and they are the same for all 3 files. (a0, a1 are the axes lengths for 1 dataset) If you don't know them, you can get them from the .shape attribute

Example for method 1 (pre-allocating array jets3x_arr):

a0, a1 = 100, 100
jets3x_arr = np.empty(shape=(a0, a1, 3)) # add dtype= if not float

for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')): 
    with h5py.File(fname, 'r') as h5f:
         jets3x_arr[:,:,cnt] = h5f['jets']

Example for method 2 (using np.concatenate()):

a0, a1 = 100, 100

for cnt, fname in enumerate(glob.iglob('img_data_0?.h5')): 
    with h5py.File(fname, 'r') as h5f:
        if cnt == 0:
            jets3x_arr= h5f['jets'][()].reshape(a0,a1,1)
        else:    
            jets3x_arr= np.concatenate(\
                       (jets3x_arr, h5f['jets'][()].reshape(a0,a1,1)), axis=2)

Just a brief comment about the h5py dataset object. I actually tried exactly what you suggested at first. basically: `jets_dset = h5f['jets'] # w/out [()] returns a h5py dataset object` But this caused an error when I ran it using the "with ... as df" method of opening the original dataset. Still, thanks for this really good explanation of what is going on! — physicscitizen, Jun 24 '22 at 14:14
You can absolutely create h5py dataset objects with Python's file context manager (eg `with ... as df:`). I do it all the time. `with...as` uses the same file open/close functions. Most likely it is a downstream error with the dataset object. They usually '_behave like_' numpy arrays, but sometimes you need an array, not an object. Examples: Image libraries (PIL and cv2) can only convert **an array** to an image. Also, you can't reshape datasets, only arrays. You can see that in example2 with `np.concatenate()`. I tested Example 1. It works as-is, w/out `[()]`. — kcw78, Jun 24 '22 at 14:55

Reading a set of HDF5 files and then slicing the resulting datasets without storing them in the end

1 Answers1