I am testing ways of efficient saving and retrieving data using h5py. But am having trouble with running time while not using up all my memory.
In my first method I simply create a static h5py file
with h5py.File(fileName, 'w') as f:
f.create_dataset('data_X', data = X, dtype = 'float32')
f.create_dataset('data_y', data = y, dtype = 'float32')
In the second method, I set parameter maxshape in order to append more training data in the future. (see How to append data to one specific dataset in a hdf5 file with h5py)
with h5py.File(fileName2, 'w') as f:
f.create_dataset('data_X', data = X, dtype = 'float32',maxshape=(None,4919))
f.create_dataset('data_y', data = y, dtype = 'float32',maxshape=(None,6))
I am using PyTorch and am set up my data loader as such:
class H5Dataset_all(torch.utils.data.Dataset):
def __init__(self, h5_path):
# super(dataset_h5, self).__init__()
self.h5_path = h5_path
self._h5_gen = None
def __getitem__(self, index):
if self._h5_gen is None:
self._h5_gen = self._get_generator()
next(self._h5_gen)
return self._h5_gen.send(index)
def _get_generator(self):
with h5py.File( self.h5_path, 'r') as record:
index = yield
while True:
X = record['data_X'][index]
y = record['data_y'][index]
index = yield X, y
def __len__(self):
with h5py.File(self.h5_path,'r') as record:
length = record['data_X'].shape[0]
return length
loader = Data.DataLoader(
dataset=H5Dataset_all(filename),
batch_size=BATCH_SIZE,
shuffle=True, num_workers=0)
Having saved the same data for each of these methods I would expect them to be similar in running time, however that is not the case. The data I used has size X.shape=(200722,4919)
and y.shape=(200772,6)
. The files are about 3.6 GB each.
I test the running time using:
import time
t0 = time.time()
for i, (X_batch, y_batch) in enumerate(loader):
# assign a dummy value
a = 0
t1 = time.time()-t0
print(f'time: {t1}')
For the first method the running time is 83 s and for the second it is 1216 s, which In my mind doesn't make sense. Can anyone help me figure out why?
Additionally I also tried saving/loading it as a torch file using torch.save
and torch.load
and passing the data to Data.TensorDataset
before setting up the loader. This implementation runs significantly faster (about 3.7 s), but has the disadvantage of having to load the files before training, which could quickly be capped by my memory.
Is there a better way in which I can train somewhat fast while not using having to load all of the data before training?