Reading .h5 file is extremely slow

Question

My data is stored in .h5 format. I use a data generator to fit the model and it is extremely slow. A snippet of my code is provided below.

def open_data_file(filename, readwrite="r"):
    return tables.open_file(filename, readwrite)

data_file_opened = open_data_file(os.path.abspath("../data/data.h5"))

train_generator, validation_generator, n_train_steps, n_validation_steps = get_training_and_validation_generators(
        data_file_opened,
        ......)

where:

def get_training_and_validation_generators(data_file, batch_size, ...):
    training_generator = data_generator(data_file, training_list,....)

data_generator function is as follows:

def data_generator(data_file, index_list,....):
      orig_index_list = index_list
    while True:
        x_list = list()
        y_list = list()
        if patch_shape:
            index_list = create_patch_index_list(orig_index_list, data_file, patch_shape,
                                                 patch_overlap, patch_start_offset,pred_specific=pred_specific)
        else:
            index_list = copy.copy(orig_index_list)

        while len(index_list) > 0:
            index = index_list.pop()
            add_data(x_list, y_list, data_file, index, augment=augment, augment_flip=augment_flip,
                     augment_distortion_factor=augment_distortion_factor, patch_shape=patch_shape,
                     skip_blank=skip_blank, permute=permute)
            if len(x_list) == batch_size or (len(index_list) == 0 and len(x_list) > 0):
                yield convert_data(x_list, y_list, n_labels=n_labels, labels=labels, num_model=num_model,overlap_label=overlap_label)
                x_list = list()
                y_list = list()

add_data() is as follows:

def add_data(x_list, y_list, data_file, index, augment=False, augment_flip=False, augment_distortion_factor=0.25,
             patch_shape=False, skip_blank=True, permute=False):
    '''
    add qualified x,y to the generator list
    '''
#     pdb.set_trace()
    data, truth = get_data_from_file(data_file, index, patch_shape=patch_shape)
    
    if np.sum(truth) == 0:
        return
    if augment:
        affine = np.load('affine.npy')
        data, truth = augment_data(data, truth, affine, flip=augment_flip, scale_deviation=augment_distortion_factor)

    if permute:
        if data.shape[-3] != data.shape[-2] or data.shape[-2] != data.shape[-1]:
            raise ValueError("To utilize permutations, data array must be in 3D cube shape with all dimensions having "
                             "the same length.")
        data, truth = random_permutation_x_y(data, truth[np.newaxis])
    else:
        truth = truth[np.newaxis]

    if not skip_blank or np.any(truth != 0):
        x_list.append(data)
        y_list.append(truth)

Model training:

def train_model(model, model_file,....):
    model.fit(training_generator,
                        steps_per_epoch=steps_per_epoch,
                        epochs=n_epochs,
                        verbose = 2,
                        validation_data=validation_generator,
                        validation_steps=validation_steps)

My dataset is large: data.h5 is 55GB. It takes around 7000s to complete one epoch. And I get a segmentation fault error after like 6 epochs. The batch size is set to 1, because otherwise, I get a resource exhausted error. Is there an efficient way to read data.h5 in the generator so that training is faster and doesn't lead to out-of-memory errors?

Dataset size is 55GB. Data is stored in .h5 format as data.h5. I use pytables to open the file. — Dushi Fdz, Aug 09 '21 at 02:12
How many times do you read data from the .h5 file in 1 epoch? (how many calls to read functions?) Speed decreases with number of I/O operations. Also, are you using fancy indexing? That is slower than simple slices. — kcw78, Aug 09 '21 at 23:38
@kcw78 Number of training steps in each epoch is 2268. My batch size is 1. If I increase batch size I get a resource exhausted error. Even with a batch size of 1, I get a segmentation fault in about 6 epochs. I am not using any fancy indexing. My data generator function is provided above. — Dushi Fdz, Aug 09 '21 at 23:54
Except for the `open_data_file()` function, I don't see any `tables` code in your post. (Is it in the `add_data()` function?) Performance bottlenecks are hard to identify and resolve without seeing the code and understanding the .h5 file schema. If you don't want to share that info, you need to write code to mimic how `add_data()` reads your .h5 file. Then you can test file read performance to determine if that is the cause of performance and stability problems. — kcw78, Aug 11 '21 at 18:57
I edited the question with `add_data()` function. I have used tables when creating data as data.h5. — Dushi Fdz, Aug 11 '21 at 19:04
Ok I think I got it. `data_generator()` loops on `while len(index_list) > 0:`, calling `add_data()` which calls `get_data_from_file()`. I assume this function calls the `tables` functions to read your .h5 data. How big is `index_list`? This is the # of times you access the file in each epoch. Multiply `len(index_list)` X epochs (2268) to get the total for an epoch. That could be a very big number, which would explain why your process is so slow. To improve performance, you need to reduce # of read calls by reading more data at one time. — kcw78, Aug 11 '21 at 22:09
The length of the index list is 3325. The number of training steps in each epoch is 2268. Can you please tell me what needs to be changed to read more data at one time? If I increase the batch size I get a resource exhausted error. — Dushi Fdz, Aug 11 '21 at 23:52
The goal is to reduce the number of times you call a `tables` function to read data. It's hard to give specific advice without source for `get_data_from_file()`. What are you reading? Image data? Are you reading 1 image at a time? If so, you need to refactor your code to read all desired images for 1 epoch in 1 call. There are similar questions on SO. Read the comments in these for more ideas: https://stackoverflow.com/a/67655331/10462884 and https://stackoverflow.com/a/66681133/10462884 — kcw78, Aug 12 '21 at 14:10
Thanks for the links. I am reading image data. This is the repository I am following to generate data: https://github.com/woodywff/brats_2019/blob/60dc83169e29888983d3baf6ef23e6a1bb43a9ec/unet3d/generator.py — Dushi Fdz, Aug 12 '21 at 15:18
As others have pointed out above, the core issue is likely due to inefficient data access patterns. HDF5 supports compression. Is your data file highly compressed? That could be one of potentially many factors that could contribute to slow I/O. Also, on hardware: spinny or SSD? RAM capacity? If your RAM capacity is large (like 256 GB), and the uncompressed images are ~60GB, consider loading the entire input into memory for fast access. If it's still slow, then data structures / algos are inefficient or images too big for what the code was written for -- perhaps downsample. — Salmonstrikes, Aug 15 '21 at 12:08
@Salmonstrikes Yes, my data is highly compressed (compression level is 5 in a scale of 0-9). Should I reduce or increase the compression level? My RAM has 32 GB of memory, my GPU 10 GB, and my data is 55 GB (stored as data.h5). Data is cretated as: ```data_storage = hdf5_file.create_earray(hdf5_file.root, 'data', tables.Float32Atom(), shape=data_shape, filters=filters, expectedrows=n_samples)``` — Dushi Fdz, Aug 15 '21 at 16:52
@Salmonstrikes makes a good point about compression - it slows I/O. Sometimes it can be significant (especially at higher compression levvels - I only use level=1). It's easy enough to uncompress the file and compare performance. PyTables has a `ptrepack` utility that can do this. This is how to uncompress your data file to a new file: `ptrepack --complevel 0 data.h5 data_unc.h5`. Change the name of the data file in your code to `data_unc.h5` — kcw78, Aug 15 '21 at 17:15

score 4 · Answer 1 · answered Aug 15 '21 at 17:08

This is the start of my answer. I looked at your code, and you have a lot of calls to read the .h5 data. By my count, the generator makes 6 read calls for every loop on training_list and validation_list. So, that's almost 20k calls on ONE training loop. It's not clear (to me) if the generators are called on every training loop. If they are, multiply by 2268 loops.

Efficiency of HDF5 file read depends on the number to calls to read the data (not just the amount of data). In other words, it is faster to read 1GB of data in a single call than it is to read the same data with 1000 calls x 1MB at a time. So the first thing we need to determine is the amount of time spent reading data from the HDF5 file (to be compare to your 7000s).

I isolated the PyTables calls that read the data file. From that, I built a simple program that mimics the behavior of your generator function. Currently it makes a single training loop on the entire sample list. Increase n_train and n_epoch values if you want the to run a longer test. (Note: The code syntax is correct. However without the file, so can't verify the logic. I think it's correct, but you may have to fix small errors.)

See code below. It should run standalone (all dependencies are imported). It prints basic timing data. Run it to benchmark your generator.

import tables as tb
import numpy as np
from random import shuffle 
import time

with tb.open_file('../data/data.h5', 'r') as data_file:

    n_train = 1
    n_epochs = 1
    loops = n_train*n_epochs
    
    for e_cnt in range(loops):  
        nb_samples = data_file.root.truth.shape[0]
        sample_list = list(range(nb_samples))
        shuffle(sample_list)
        split = 0.80
        n_training = int(len(sample_list) * split)
        training_list = sample_list[:n_training]
        validation_list = sample_list[n_training:]
        
        start = time.time()
        for index_list in [ training_list, validation_list ]:
            shuffle(index_list)
            x_list = list()
            y_list = list()
            
            while len(index_list) > 0:
                index = index_list.pop() 
                
                brain_width = data_file.root.brain_width[index]
                x = np.array([modality_img[index,0,
                                           brain_width[0,0]:brain_width[1,0]+1,
                                           brain_width[0,1]:brain_width[1,1]+1,
                                           brain_width[0,2]:brain_width[1,2]+1] 
                              for modality_img in [data_file.root.t1,
                                                   data_file.root.t1ce,
                                                   data_file.root.flair,
                                                   data_file.root.t2]])
                y = data_file.root.truth[index, 0,
                                         brain_width[0,0]:brain_width[1,0]+1,
                                         brain_width[0,1]:brain_width[1,1]+1,
                                         brain_width[0,2]:brain_width[1,2]+1]    
                
                x_list.append(data)
                y_list.append(truth)
    
        print(f'For loop:{e_cnt}')
        print(f'Time to read all data={time.time()-start:.2f}')

Thanks a lot for the detailed answer. I will check it and see if I get any errors. Can you please explain a bit about setting `n_train = 1` and `n_epochs = 1`. So when you said 'it makes a single training loop on the entire sample list', does it mean it call data only once. If I train (model.fit) for 10 epochs I don't have to change `n_epochs` here, do I? — Dushi Fdz, Aug 16 '21 at 15:48
Correct. With `n_train = 1` and `n_epochs = 1`, the entire sample list is only read once. That will give you a feel for the time to read the data. I did it that way because I'm not sure when the generators are called. I don't think the generators are called for epoch loops. I'm not sure about training loops. Also, you can compare the time to read a compress vs uncompressed file. — kcw78, Aug 16 '21 at 18:35
One more question, please. If the batch size is greater than the GPU memory (10GB), doe it goes into the CPU? In that case, can a segmentation fault occur? My data file size is 55GB. Because, apart from the issue with slow training, after like 6 epochs I get a segmentation fault. I am not sure if it's related to the memory shortage. — Dushi Fdz, Aug 17 '21 at 00:36
How long does it take to read your data for 1 loop? If it's "fast enough" your problems are somewhere else. Your question goes beyond my knowledge of algorithms and memory usage. I'm 99% sure PyTables uses CPU (system) RAM (only). Segmentation fault at 6 epochs sounds like a memory problem in TF. I know it can use GPU memory, but don't know how to control GPU vs CPU memory usage. Here is an interesting SO question from 2018: https://stackoverflow.com/q/51343169/10462884. For more related questions/answers, search questions tagged `[tensorflow] [gpu]`. Good luck. — kcw78, Aug 17 '21 at 14:19

Reading .h5 file is extremely slow

1 Answers1

Linked