A more efficient way of storing a large number of RGB images in an HDF5 file

Question

I have a few hundred large images (more than 100K x 200K pixels). I am dividing each of these images into 256 x 256 patches and storing them all in a HDF5 file with the following structure:

Here is my code to recreate this HDF5 structure:

def save_to_hdf5(slide_name, patches, coords, labels, db_name , db_location):
    with h5py.File(db_location + f'training{db_name}.h5' ,'a') as hf:
        patient_index = "_".join(os.path.basename(slide_name).split('.')[0].split('_')[:2])
        slide_index = "_".join(os.path.basename(slide_name).split('.')[0].split('_')[3])
        slide_label = labels[os.path.basename(slide_name)]
    
        grp = hf.require_group(patient_index)
        subgrp = grp.require_group('wsi_{}'.format(slide_index))
        for i, patch in enumerate(patches):
        subsubgrp = subgrp.require_group('patch_{}'.format(i))
        subsubgrp.create_dataset('image', np.shape(patch), data=patch, compression="gzip", compression_opts=7)#, chunks=True)
        subsubgrp.create_dataset('label', np.shape(slide_label), data=slide_label)
        subsubgrp.attrs["patch_coords"] = (coords[i][0], coords[i][1])

Now the size of the HDF5 file for some large images is even larger than the original image itself. I was wondering if I am doing something wrong with my group and dataset creation steps in the code?

It's not a surprise that the file is larger, is it? You're just adding overhead to the image data itself. Let's be honest, there is no efficient way to store 60GB images, especially when you have a hundred of them (6TB). Why wouldn't you just keep them as image files? What do you gain by making an HDF5? — Tim Roberts, Sep 22 '21 at 18:52
The images are passed as `patches` (as a list of arrays), right? How did you convert the images to arrays? If you really want to store the images in HDF5 format you should investigate external links to avoid immense files. This way, you create multiple HDF5 files (1 for each patient), plus a "master file" with links to the files. Then you work with the "master file" as if it has all the data, but don't have 1 very large file. See Method 1 here for an example - [How can I combine multiple .h5 file?](https://stackoverflow.com/a/58223603/10462884) — kcw78, Sep 22 '21 at 20:29

A more efficient way of storing a large number of RGB images in an HDF5 file

0 Answers0