What is the most efficient way to read an hdf5 file containing an image stored as a numpy array?

Question

I'm converting image files to hdf5 files as follows:

import h5py
import io
import os
import cv2
import numpy as np
from PIL import Image

def convertJpgtoH5(input_dir, filename, output_dir):
    filepath = input_dir + '/' + filename
    print('image size: %d bytes'%os.path.getsize(filepath))
    img_f = open(filepath, 'rb')
    binary_data = img_f.read()
    binary_data_np = np.asarray(binary_data)
    new_filepath = output_dir + '/' + filename[:-4] + '.hdf5'
    f = h5py.File(new_filepath, 'w')
    dset = f.create_dataset('image', data = binary_data_np)
    f.close()
    print('hdf5 file size: %d bytes'%os.path.getsize(new_filepath))

pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5/files'
ext = [".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]

for img in os.listdir(pathImg):
        if img.endswith(tuple(ext)):
            convertJpgtoH5(pathImg, img, pathH5)

I later read these hdf5 files as follows:

for hf in os.listdir(pathH5):
    if hf.endswith(".hdf5"):
        hf = h5py.File(f"{pathH5}/{hf}", "r")
        key = list(hf.keys())[0]
        data = np.array(hf[key]) 
        img = Image.open(io.BytesIO(data))
        image = cv2.cvtColor(np.float32(img), cv2.COLOR_BGR2RGB)
        hf.close()

Is there a more efficient way to read the hdf5 files rather than converting to numpy array, opening with Pillow before using with OpenCV?

What do you mean by *"efficient"*? Do you want to minimise the disk space required? Or the time to read the file? Or reduce the number of library dependences? — Mark Setchell, Apr 20 '21 at 16:54
If I am following your code correctly, you are creating 1 HDF5 file for each image, right? If so, you will find with HDF5 that # of write calls is more important than the size of the data written. So, it will be faster to read all of the images, convert to a numpy array, add each to a larger array (sized to hold all images), then write the array to HDF5 as a single dataset once all images have been read and converted. 2 examples: 1) [Simple](https://stackoverflow.com/a/66823010/10462884) and 2) [Detailed](https://stackoverflow.com/a/66641176/10462884) — kcw78, Apr 20 '21 at 17:05
Also, why are you using Pillow and OpenCV? Either is sufficient. You don't need both. — kcw78, Apr 20 '21 at 17:07
Have you checked your code to read the H5 files? I get an error on `hf = h5py.File(f"data/{hf}", "r")`. It should be: `hf = h5py.File(f"{pathH5}/{hf}"", "r")` — kcw78, Apr 20 '21 at 17:52
@MarkSetchell the most important factor for me is the read time. — Matthew-Sharp, Apr 26 '21 at 14:10
@kcw78 yes I'm creating 1 HDF5 file for each image. The reason I do this is to mimic the way in which my end user has saved their files. I'll suggest to them to save multiple images in one file. I've updated my comment to correct the error you have identified with the group name. — Matthew-Sharp, Apr 26 '21 at 14:11
@kcw78 The reason I chose to use Pillow to write the files as binary data is because this takes up much less space than using OpenCV. with Pillow: image size: 94229 bytes hdf5 file size: 2809277 bytes with OpenCV: image size: 94229 bytes hdf5 file size: 96277 bytes — Matthew-Sharp, Apr 26 '21 at 14:45
The HDF5 files I get when reading .ppm files with PIL and OpenCV are approximately the same size as the image file. PPM file size: 23509 bytes; HDF5 from PIL: 25557 bytes; HDF5 file from OpenCV: 25544 bytes. (My files are smaller than yours. Not sure if that makes a difference.) It sounds like you have locked in your process. If you are happy with performance, go with it. If not, what alternatives you are willing to consider? — kcw78, Apr 26 '21 at 19:24

kcw78 · Answer 1 · 2021-04-20T22:29:39.913

Ideally this should be closed as a duplicate because most of what you want to do is explained in the answers I referenced in my comments above. I am including those links here:

There is one difference: my examples load all the image data into 1 HDF5 file, and you are creating 1 HDF5 file for each image. Frankly, I don't think there is much value doing that. You wind up with twice as many files and there's nothing gained. If you are still interested in doing that, here are 2 more answers that might help (and I updated your code at the end):

In the interest of addressing your specific question, I modified your code to use cv2 only (no need for PIL). I resized the images and saved as 1 dataset in 1 file. If you are using the images for training and testing a CNN model, you need to do this anyway (it needs arrays of size/consistent shape). Also, I think you can save the data as int8 -- no need for floats. See below.

import h5py
import glob
import os
import cv2
import numpy as np

def convertImagetoH5(imgfilename):
    print('image size: %d bytes'%os.path.getsize(imgfilename))
    img = cv2.imread(imgfilename, cv2.COLOR_BGR2RGB)
    img_resize = cv2.resize(img, (IMG_WIDTH, IMG_HEIGHT) )
    return img_resize 


pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5file'
ext_list = [".ppm", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]
IMG_WIDTH = 120
IMG_HEIGHT = 120

#get list of all images and number of images
all_images = []
for ext in ext_list:
    all_images.extend(glob.glob(pathImg+"/*"+ext, recursive=True))
n_images = len(all_images)

ds_img_arr = np.zeros((n_images, IMG_WIDTH, IMG_HEIGHT,3),dtype=np.uint8)

for cnt,img in enumerate(all_images):
    img_arr = convertImagetoH5(img)
    ds_img_arr[cnt]=img_arr[:]
    
h5_filepath = pathH5 + '/all_image_data.hdf5'
with h5py.File(h5_filepath, 'w') as h5f:
    dset = h5f.create_dataset('images', data=ds_img_arr)

print('hdf5 file size: %d bytes'%os.path.getsize(h5_filepath))

with h5py.File(h5_filepath, "r") as h5r:
    key = list(h5r.keys())[0]
    print (key, h5r[key].shape, h5r[key].dtype)

If you really want 1 HDF5 for each image, the code from your question is updated below. Again, only cv2 is used -- no need for PIL. Images are not resized. This is for completeness only (to demonstrate the process). It's not how you should manage your image data.

import h5py
import os
import cv2
import numpy as np

def convertImagetoH5(input_dir, filename, output_dir):
    filepath = input_dir + '/' + filename
    print('image size: %d bytes'%os.path.getsize(filepath))
    img = cv2.imread(filepath, cv2.COLOR_BGR2RGB)
    new_filepath = output_dir + '/' + filename[:-4] + '.hdf5'
    with h5py.File(new_filepath, 'w') as h5f:
        h5f.create_dataset('image', data =img)
    print('hdf5 file size: %d bytes'%os.path.getsize(new_filepath))

pathImg = '/path/to/images'
pathH5 = '/path/to/hdf5file'
ext = [".ppm", ".jpg", ".jpeg", ".png", ".bmp", ".tiff", ".tif"]

# Loop thru image files and create a matching HDF5 file
for img in os.listdir(pathImg):
        if img.endswith(tuple(ext)):
            convertImagetoH5(pathImg, img, pathH5)

# Loop thru HDF5 files and read image dataset (as an array)
for h5name in os.listdir(pathH5):
    if h5name.endswith(".hdf5"):
        with h5f = h5py.File(f"{pathH5}/{h5name}", "r") as h5f:
            key = list(h5f.keys())[0]
            image = h5f[key][:]
            print(f'{h5name}: {image.shape}, {image.dtype}')

What is the most efficient way to read an hdf5 file containing an image stored as a numpy array?

1 Answers1

Linked