How to configure maxshape argument for H5 and append to file?

Question

I'm trying to combine an image dataset into a H5 file. So far I have managed to create the file but when I append to it, it just overwrites what's already there. I've looked at other answers (e.g. Adding data to existing h5py file along new axis using h5py) and tried their variations but to no avail.

for i in range(len(files)):
    if i == 0:
        with h5py.File('input_images.h5', 'w') as f:
            img = np.array(Image.open(files[i]))
            f.create_dataset('/array', data = img, maxshape = (None), chunks = True, dtype = img.dtype)
    else:
        with h5py.File('input_images.h5', 'r+') as f:
            img = np.array(Image.open(files[i]))
            f.require_dataset('/array', data = img, shape = img.shape, dtype = img.dtype)
    print(i)

I've tried setting maxshape to (None, None, None) but that just creates an error: ValueError: "maxshape" must have same rank as dataset shape

There are 1000 images in total, each of shape 2048 by 2048. Can someone show me how to fix my code?

See this answer. The first step creates a HDF5 file with an image dataset similar to yours. However, it loads all image data in one shot. I will try to create an example that does it incrementally for you in a bit. [How get data from hdf5 file](https://stackoverflow.com/a/61510994/10462884) — kcw78, May 14 '20 at 13:14
Here is another example showing use of `maxshape` and `resize`. [way to remove rows from a HDF5 dataset](https://stackoverflow.com/a/61489807/10462884) Adding or removing work the same. This example is simpler: the dataset and array have the same number of dimensions. — kcw78, May 14 '20 at 13:24

kcw78 · Accepted Answer · 2020-05-15T19:20:27.970

Using the maxshape parameter allows you to modify the dataset size. Note, maxshape needs to match of dimensions of your image dataset. You entered 1 dimension, but need 3 for all image data (1000, 2048, 2048). Also the initial dataset size in your code is set from the size of the data=img array size. It will have shape (2048,2048). The dataset needs a third dimension for all image data.
There are 3 approaches to load all your image data:
1. Set shape=(nfiles,a1,a2) to initially size for all images. No need to resize unless you want add more images later.
2. Initially set shape=(1,a1,a2) (for 1 image), then use .resize() to increase the size as you add images. This method is not very efficient as your datasets grow.
3. Initially set shape=(N,a1,a2) (for N images), then use .resize() to increase the size by N when the dataset is full. (N can be any number. I used 10 in the example below, but you might use 100 or 1000 for a real world application).

All 3 methods are in the example below for 30 images w/ a smaller image size. I create random integer data for the images. Replace np.random.randint() with np.array(Image.open(files[i])) for your files.

The examples demonstrates the process. Note that Methods 1 and 2 will only work when you create the HDF5 file and populate the imaged data (because the dataset index is the same as the image counter). Method 3 shows how to add data incrementally. It uses an attribute that counts the number of images loaded. The counter sets the position to add the new image. It is also used to check current dataset size (and resize as needed).

In production code you need additional checks that image size and shape match dataset size and shape.

import h5py
import numpy as np
nfiles=30
a0 = nfiles  # for number of images
a1= 256 ; a2 = 256 # for image size

with h5py.File('input_images1.h5', 'w') as f:    
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if i == 0:
            img_ds = f.create_dataset('/array', shape=(a0,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
        f['/array'][i,:,:]=img_arr
        print(i)

with h5py.File('input_images2.h5', 'w') as f:    
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if i == 0:
            img_ds = f.create_dataset('/array', shape=(1,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
        else:
            f['/array'].resize(i+1,axis=0)
        f['/array'][i,:,:]=img_arr
        print(i)        

with h5py.File('input_images3.h5', 'a') as f:
    for i in range(nfiles):
        img_arr = np.random.randint(0,254, (a1, a2), int)
        if 'array' not in f.keys() :
            img_ds = f.create_dataset('/array', shape=(10,a1,a2), 
                             maxshape = (None,a1,a2), chunks = True)
            img_ds.attrs['n_images'] = 0
        else:
            img_ds = f['/array']

        n_images = img_ds.attrs['n_images']
        if n_images == img_ds.shape[0] :
            print ('adding 10 rows to /array')
            img_ds .resize(img_ds.shape[0]+10,axis=0)

        img_ds[n_images,:,:]=img_arr
        img_ds.attrs['n_images'] = n_images+1
        print(img_ds.attrs['n_images'])

In hindsight I realize this answer has the following limitations: 1) it only works if you add all the image files at one time. Otherwise, the row counter (`i`) will restart at zero. 2) the second method adds one row at a time (which is not very efficient as your datasets grow). I added a third method (my preferred method). It creates the dataset with some rows, and uses an attribute to track the number of images that have been added. The counter is used to check for need to resize and position to add the new images. — kcw78, May 15 '20 at 19:01

How to configure maxshape argument for H5 and append to file?

1 Answers1

Linked