03-22-2021: See update about attributes noted below.
This is an interesting use case. My answer to a previous question touched on this issue (referenced in my first answer to this question). Clearly the overhead when writing a large volume of small objects is larger than the actual write process. I was curious, so I created a prototype to explore different process to write the data.
My starting scenario:
- I created a NumPy array of random integers with shape (NN,20,20,5).
- I then followed your logic to slice 1 row at a time and allocated as
a training, validation, or test sample.
- I wrote the slice as a new dataset in the appropriate group.
- I added attributes to the group to reference the slice # for each dataset.
Key findings:
- Time to write each array slice to a new dataset remains relatively constant throughout the process.
- However, write times grow exponentially as the number of attributes (NN) increases. This was not understood in my initial
post. For small values of NN (<2,000) adding attributes is
relatively fast.
Table with incremental write times for each 1,000 slices (without and with attributes). (Multiply by NN/1000 for total time.)
Slice |
Time (sec) |
Time (sec) |
Count |
(w/out attrs) |
(with attrs) |
1_000 |
0.34 |
2.4 |
2_000 |
0.34 |
12.7 |
5_000 |
0.33 |
111.7 |
10_000 |
0.34 |
1783.3 |
20_000 |
0.35 |
n/a |
Obviously using attributes is not an efficient way to save the slice indices. Instead, I captured as part of the dataset name. This is shown in the "original" code below. The code to add attributes is included in case that is of interest.
I created a new process to do all of the slicing first, then write all of the data in 3 steps (1 each for the training, validation, and test samples). Since you can't get the slice indices from the dataset names, I tested 2 different ways to save that data: 1) as a second "index" dataset for each "sample" dataset and 2) as group attributes. Both methods are significantly faster. Writing indices as an index dataset has almost no impact on performance. Writing them as attributes is much slower. The data:
Table with total write times for all slices (without and with attributes).
Slice |
Time (secs) |
Time (secs) |
Time (secs) |
Count |
(no indices) |
(index dataset) |
(with attrs) |
10_000 |
0.43 |
0.57 |
141.05 |
20_000 |
1.17 |
1.27 |
n/a |
This method looks like a promising way to slice and write your data to HDF5 in a reasonable amount of time. You will have to work on indexing notation.
Code for starting scenario:
#define possible groups
groups=['training','validation','test']
# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
label = np.array([1])
with h5py.File('TrainingData_orig.hdf5', 'w') as h5f :
#At this point I run a sub-sampling function that returns a NumPy array,
#trarray, of size (x,20,20,5).
for group in groups:
h5f.create_group(group+'_samples')
h5f.create_group(group+'_labels')
time0 = timeit.default_timer()
for i in range(trarray.shape[0]):
group_choice = random.choices(groups, weights = [65, 15, 20])
h5f[group_choice[0]+'_samples'].create_dataset(f'ID-{i:04}', data=trarray[i,:,:,:])
#h5f[group_choice[0]+'_labels'].create_dataset(f'ID-{i:04}', data=label)
#h5f[group_choice[0]+'_samples'].attrs[f'ID-{i:04}'] = label
if (i+1) % 1000 == 0:
exe_time = timeit.default_timer() - time0
print(f'incremental time to write {i+1} datasets = {exe_time:.2f} secs')
time0 = timeit.default_timer()
Code for test scenario:
Note: calls to write attributes to groups are commented out.
#define possible groups
groups=['training_samples','validation_samples','test_samples']
# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
training = np.empty(trarray.shape,dtype=np.int32)
validation = np.empty(trarray.shape,dtype=np.int32)
test = np.empty(trarray.shape,dtype=np.int32)
indx1, indx2, indx3 = 0, 0, 0
training_list = []
validation_list = []
test_list = []
training_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
validation_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
test_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
start = timeit.default_timer()
#At this point I run a sub-sampling function that returns a NumPy array,
#trarray, of size (x,20,20,5).
for i in range(trarray.shape[0]):
group_choice = random.choices(groups, weights = [65, 15, 20])
if group_choice[0] == 'training_samples':
training[indx1,:,:,:] = data=trarray[i,:,:,:]
training_list.append( (f'ID-{indx1:04}', i) )
training_idx[indx1,:]= [indx1,i]
indx1 += 1
elif group_choice[0] == 'validation_samples':
validation[indx2,:,:,:] = data=trarray[i,:,:,:]
validation_list.append( (f'ID-{indx2:04}', i) )
validation_idx[indx2,:]= [indx2,i]
indx2 += 1
else:
test[indx3,:,:,:] = data=trarray[i,:,:,:]
test_list.append( (f'ID-{indx3:04}', i) )
test_idx[indx3,:]= [indx3,i]
indx3 += 1
with h5py.File('TrainingData1_.hdf5', 'w') as h5f :
h5f.create_group('training')
h5f['training'].create_dataset('training_samples', data=training[0:indx1,:,:,:])
h5f['training'].create_dataset('training_indices', data=training_idx[0:indx1,:])
# for label, idx in training_list:
# h5f['training']['training_samples'].attrs[label] = idx
h5f.create_group('validation')
h5f['validation'].create_dataset('validation_samples', data=validation[0:indx2,:,:,:])
h5f['validation'].create_dataset('validation_indices', data=validation_idx[0:indx2,:])
# for label, idx in validation_list:
# h5f['validation']['validation_samples'].attrs[label] = idx
h5f.create_group('test')
h5f['test'].create_dataset('test_samples', data=test[0:indx3,:,:,:])
h5f['test'].create_dataset('test_indices', data=test_idx[0:indx3,:])
# for label, idx in test_list:
# h5f['test']['test_samples'].attrs[label] = idx
exe_time = timeit.default_timer() - start
print(f'Write time for {trarray.shape[0]} images slices = {exe_time:.2f} secs')