Efficiently saving and loading data using h5py (or other methods)

Question

I am testing ways of efficient saving and retrieving data using h5py. But am having trouble with running time while not using up all my memory.

In my first method I simply create a static h5py file

with h5py.File(fileName, 'w') as f:
        f.create_dataset('data_X', data = X, dtype = 'float32')
        f.create_dataset('data_y', data = y, dtype = 'float32')

In the second method, I set parameter maxshape in order to append more training data in the future. (see How to append data to one specific dataset in a hdf5 file with h5py)

with h5py.File(fileName2, 'w') as f:
            f.create_dataset('data_X', data = X, dtype = 'float32',maxshape=(None,4919))
            f.create_dataset('data_y', data = y, dtype = 'float32',maxshape=(None,6))

I am using PyTorch and am set up my data loader as such:

class H5Dataset_all(torch.utils.data.Dataset):
    def __init__(self, h5_path):
        # super(dataset_h5, self).__init__()
        self.h5_path = h5_path
        self._h5_gen = None

    def __getitem__(self, index):
        if self._h5_gen is None:
            self._h5_gen = self._get_generator()
            next(self._h5_gen)
        return self._h5_gen.send(index)

    def _get_generator(self):
        with h5py.File( self.h5_path, 'r') as record:
            index = yield
            while True:
                X = record['data_X'][index]
                y = record['data_y'][index]
                index = yield X, y

    def __len__(self):
        with h5py.File(self.h5_path,'r') as record:
            length = record['data_X'].shape[0]
            return length

loader = Data.DataLoader(
        dataset=H5Dataset_all(filename), 
        batch_size=BATCH_SIZE, 
        shuffle=True, num_workers=0)

Having saved the same data for each of these methods I would expect them to be similar in running time, however that is not the case. The data I used has size X.shape=(200722,4919) and y.shape=(200772,6). The files are about 3.6 GB each. I test the running time using:

import time
t0 = time.time()
for i, (X_batch, y_batch) in enumerate(loader):
    # assign a dummy value
    a = 0 
t1 = time.time()-t0
print(f'time: {t1}')

For the first method the running time is 83 s and for the second it is 1216 s, which In my mind doesn't make sense. Can anyone help me figure out why?

Additionally I also tried saving/loading it as a torch file using torch.save and torch.load and passing the data to Data.TensorDataset before setting up the loader. This implementation runs significantly faster (about 3.7 s), but has the disadvantage of having to load the files before training, which could quickly be capped by my memory.

Is there a better way in which I can train somewhat fast while not using having to load all of the data before training?

It seems to me you have a fixed dataset size with the first method. (It's the size is required to store `data = X` and `data = Y`.) Also how are you extending the size of the datasets? I would expect to see a call to `Dataset.resize()` to increase the size/shape from the original size/shape allocation. Take a look at this answer for an example of `Dataset.resize()`. [resizing-and-storing-dataset-using-h5py](https://stackoverflow.com/a/53853908/10462884) — kcw78, Mar 23 '20 at 19:18
Yes, the first method has a fixed size and I'm extending the size of the dataset in the second like: ```python f['data_X'].resize(f['data_X'].shape[0]+X.shape[0],axis = 0) ``` and then ```python f['data_X'][-X.shape[0]:] = X ``` But in my my reported running time this was never called — fibbi, Mar 24 '20 at 08:41
How many times do you call the loader? Does the loader write to the same HDF5 file and dataset with each call? If so, in method 1, the loader is simply overwriting the existing data with the new data. You will see this in the file and dataset size -- they won't change after multiple calls to the loader with method 1. — kcw78, Mar 24 '20 at 15:25
I'm not sure what you mean. I only call the loader once and use it to iterate over the saved data in *filename*. In method 1 the file is overwritten - and thus the maximum amount of data is capped by how much memory I can save at once. The idea in the second method is to continuously be able to add data to the file. The problem is that when each method are only called once (same amount of data), the loader for the second method takes much longer time to iterate over than the first. — fibbi, Mar 25 '20 at 10:39

score 0 · Accepted Answer · answered Mar 25 '20 at 18:47

This looks like an I/O performance issue. To test, I created a very simple example to compare your 2 methods. (My code is at the end of the post.) I found the exact opposite behavior (assuming my code mimics your process). Writing the dataset is slower when I don't use the maxshape=() parameter: 62 sec to create w/out maxshape versus 16 sec to create with maxshape. To verify the operations aren't order dependent, I also ran creating _2 first, then created _1, and got very similar results.
Here is the timing data:

create data_X time: 62.60318350791931  
create data_y time: 0.010000228881835  
** file 1 Done **   

create data_X time: 16.416041135787964  
create data_y time: 0.0199999809265136  
** file 2 Done **

Code to create the 2 files below:

import h5py
import numpy as np
import time

n_rows = 200722
X_cols = 4919
y_cols = 6

X = np.random.rand(n_rows,X_cols).astype('float32')
y = np.random.rand(n_rows,y_cols).astype('float32')

t0 = time.time() 
with h5py.File('SO_60818355_1.h5', 'w') as h5f:
     h5f.create_dataset('data_X', data = X)
     t1 = time.time()
     print(f'create data_X time: {t1-t0}')

     h5f.create_dataset('data_y', data = y)
     t2 = time.time()
     print(f'create data_y time: {t2-t1}')
print ('** file 1 Done ** \n ')

t0 = time.time() 
with h5py.File('SO_60818355_2.h5', 'w') as h5f:
     h5f.create_dataset('data_X', data = X, maxshape=(None,X_cols))
     t1 = time.time()
     print(f'create data_X time: {t1-t0}')

     h5f.create_dataset('data_y', data = y, maxshape=(None,y_cols))
     t2 = time.time()
     print(f'create data_y time: {t2-t1}')
print ('** file 2 Done ** \n ')

Efficiently saving and loading data using h5py (or other methods)

1 Answers1