I have 8 large h5 files (~ 100G each), each with many different datasets (say 'x','y','z','h'). I'd like merge all 8 of the 'x' and 'y' datasets into a test.h5 and train.h5 file. Is there a fast way to do this? In total I have 800080 rows so I create first my train file
save_file = h5py.File(os.path.join(base_path,'data/train.h5'),'w',libver='latest')
and after calculating a random split I create the datasets:
train_file.create_dataset('x', (num_train, 256, 256, 1))
train_file.create_dataset('y',(num_train,1))
[similarly for test_file]
train_indeces = np.asarray([1]*num_train + [0]*num_test)
np.random.shuffle(train_indeces)
then I try iterating over each of my 8 files and saving train/test.
indeces_index = 0
last_train_index = 0
last_test_index = 0
for e in files:
print(f'FILE: {e}')
rnd_file = h5py.File(f'{base_path}data/{e}', 'r', libver='latest')
for j in tqdm(range(rnd_file['x'].shape[0] )):
if train_indeces[indeces_index]==1:
train_file['x'][last_train_index] = rnd_file['x'][j]
train_file['y'][last_train_index] = rnd_file['y'][j]
last_train_index+=1
else:
test_file['x'][last_test_index] = rnd_file['x'][j]
test_file['y'][last_test_index] = rnd_file['y'][j]
last_test_index +=1
indeces_index +=1
rnd_file.close()
But by my calculations this would take ~12 days to run. Is there a (much) faster way to do this? Thanks in advance.