2

I have to store sub-samples of large images as .npy arrays of size (20,20,5). In order to sample uniformly when training a classification model, I am looking for an efficient way to store nearly 10 million sub-samples in such a way that would allow this.

If I store them as entire images, sampling during training wouldn't be representative of the distribution. I have the storage space but I would run out of inodes trying to store that many "small" files. h5py / writing to a hdf5 file is a natural answer to my problems, however the process has been very slow. Running a program for a day and a half was not enough time to write all the sub-samples. I am new to h5py and I am wondering if too many writes is the cause of this.

If so, I am unsure of how to chunk properly so as to avoid the problem of non-uniform sampling. Each image has varying numbers of sub-samples (e.g. one image may be (20000,20,20,5) and another may be (32123,20,20,5).

This is the code I use to write each sample to the .hdf5:

#define possible groups
groups=['training_samples','validation_samples','test_samples']

f = h5py.File('~/.../TrainingData_.hdf5', 'a', libver='latest')

At this point I run a sub-sampling function that returns a NumPy array, trarray, of size (x,20,20,5).

Then:

label = np.array([1])
for i in range(trarray.shape[0]):
   group_choice = random.choices(groups, weights = [65, 15, 20])
   subarr = trarray[i,:,:,:]

   if group_choice[0] == 'training_samples':
       training_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       training_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1
   elif group_choice[0] =='validation_samples':
       validation_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       validation_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1
   else:
       test_samples.create_dataset('ID-{}'.format(indx), data=subarr)
       test_labels.create_dataset('ID-{}'.format(indx), data=label)
       indx += 1

Is there something I could do to improve this / Is there something I am doing that is fundamentally wrong in regards to using h5py?

kcw78
  • 7,131
  • 3
  • 12
  • 44
  • For the optimal chunk-size and chunk-shape it is very important to know the exact reading writing pattern. You also have to set up the chunk-cache correctly (the default of 1MB is often far too small) example https://stackoverflow.com/a/48405220/4045774 The chunk-size also have a high influence on the write speed depending on the storage system https://stackoverflow.com/a/44961222/4045774 – max9111 Mar 17 '21 at 14:25

2 Answers2

3

03-22-2021: See update about attributes noted below.
This is an interesting use case. My answer to a previous question touched on this issue (referenced in my first answer to this question). Clearly the overhead when writing a large volume of small objects is larger than the actual write process. I was curious, so I created a prototype to explore different process to write the data.

My starting scenario:

  1. I created a NumPy array of random integers with shape (NN,20,20,5).
  2. I then followed your logic to slice 1 row at a time and allocated as a training, validation, or test sample.
  3. I wrote the slice as a new dataset in the appropriate group.
  4. I added attributes to the group to reference the slice # for each dataset.

Key findings:

  1. Time to write each array slice to a new dataset remains relatively constant throughout the process.
  2. However, write times grow exponentially as the number of attributes (NN) increases. This was not understood in my initial post. For small values of NN (<2,000) adding attributes is relatively fast.

Table with incremental write times for each 1,000 slices (without and with attributes). (Multiply by NN/1000 for total time.)

Slice Time (sec) Time (sec)
Count (w/out attrs) (with attrs)
1_000 0.34 2.4
2_000 0.34 12.7
5_000 0.33 111.7
10_000 0.34 1783.3
20_000 0.35 n/a

Obviously using attributes is not an efficient way to save the slice indices. Instead, I captured as part of the dataset name. This is shown in the "original" code below. The code to add attributes is included in case that is of interest.

I created a new process to do all of the slicing first, then write all of the data in 3 steps (1 each for the training, validation, and test samples). Since you can't get the slice indices from the dataset names, I tested 2 different ways to save that data: 1) as a second "index" dataset for each "sample" dataset and 2) as group attributes. Both methods are significantly faster. Writing indices as an index dataset has almost no impact on performance. Writing them as attributes is much slower. The data:

Table with total write times for all slices (without and with attributes).

Slice Time (secs) Time (secs) Time (secs)
Count (no indices) (index dataset) (with attrs)
10_000 0.43 0.57 141.05
20_000 1.17 1.27 n/a

This method looks like a promising way to slice and write your data to HDF5 in a reasonable amount of time. You will have to work on indexing notation.

Code for starting scenario:

#define possible groups
groups=['training','validation','test']

# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
label = np.array([1])    

with h5py.File('TrainingData_orig.hdf5', 'w') as h5f :
#At this point I run a sub-sampling function that returns a NumPy array, 
#trarray, of size (x,20,20,5).
    for group in groups:
        h5f.create_group(group+'_samples')   
        h5f.create_group(group+'_labels')  
    
    time0 = timeit.default_timer()
    for i in range(trarray.shape[0]):
        group_choice = random.choices(groups, weights = [65, 15, 20])    

        h5f[group_choice[0]+'_samples'].create_dataset(f'ID-{i:04}', data=trarray[i,:,:,:])
        #h5f[group_choice[0]+'_labels'].create_dataset(f'ID-{i:04}', data=label)
        #h5f[group_choice[0]+'_samples'].attrs[f'ID-{i:04}'] = label

        if (i+1) % 1000 == 0:
            exe_time = timeit.default_timer() - time0          
            print(f'incremental time to write {i+1} datasets = {exe_time:.2f} secs')           
            time0 = timeit.default_timer()

Code for test scenario:
Note: calls to write attributes to groups are commented out.

#define possible groups
groups=['training_samples','validation_samples','test_samples']

# one image may be (20000,20,20,5)
trarray = np.random.randint(1,255, (20_000,20,20,5) )
training   = np.empty(trarray.shape,dtype=np.int32)
validation = np.empty(trarray.shape,dtype=np.int32)
test       = np.empty(trarray.shape,dtype=np.int32)

indx1, indx2, indx3 = 0, 0, 0
training_list = []
validation_list = []
test_list = []

training_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
validation_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)
test_idx = np.empty( (trarray.shape[0],2) ,dtype=np.int32)

start = timeit.default_timer()

#At this point I run a sub-sampling function that returns a NumPy array, 
#trarray, of size (x,20,20,5).
for i in range(trarray.shape[0]):
    group_choice = random.choices(groups, weights = [65, 15, 20])   
    if group_choice[0] == 'training_samples':
        training[indx1,:,:,:] = data=trarray[i,:,:,:]
        training_list.append( (f'ID-{indx1:04}', i) )
        training_idx[indx1,:]= [indx1,i]
        indx1 += 1
    elif group_choice[0] == 'validation_samples':
        validation[indx2,:,:,:] = data=trarray[i,:,:,:]
        validation_list.append( (f'ID-{indx2:04}', i) )
        validation_idx[indx2,:]= [indx2,i]
        indx2 += 1
    else:
        test[indx3,:,:,:] = data=trarray[i,:,:,:]
        test_list.append( (f'ID-{indx3:04}', i) )
        test_idx[indx3,:]= [indx3,i]
        indx3 += 1


with h5py.File('TrainingData1_.hdf5', 'w') as h5f :
    
    h5f.create_group('training')
    h5f['training'].create_dataset('training_samples', data=training[0:indx1,:,:,:])
    h5f['training'].create_dataset('training_indices', data=training_idx[0:indx1,:])
    # for label, idx in training_list:
    #     h5f['training']['training_samples'].attrs[label] = idx

    h5f.create_group('validation')
    h5f['validation'].create_dataset('validation_samples', data=validation[0:indx2,:,:,:])
    h5f['validation'].create_dataset('validation_indices', data=validation_idx[0:indx2,:])
    # for label, idx in validation_list:
    #     h5f['validation']['validation_samples'].attrs[label] = idx

    h5f.create_group('test')
    h5f['test'].create_dataset('test_samples', data=test[0:indx3,:,:,:])
    h5f['test'].create_dataset('test_indices', data=test_idx[0:indx3,:])
    # for label, idx in test_list:
    #     h5f['test']['test_samples'].attrs[label] = idx

exe_time = timeit.default_timer() - start          
print(f'Write time for {trarray.shape[0]} images slices = {exe_time:.2f} secs')           
kcw78
  • 7,131
  • 3
  • 12
  • 44
1

Chunked storage is designed to optimized I/O of very large datasets. Your datasets are (1,20,20,5), right? If so, that's pretty small (in the HDF5 world), so I don't think chunking will help.

If I understand, you are going to create a new dataset for each sub-sample based on the size of trarray.shape[0] (gives 20,000 to 32,123 sub-samples -- your loop length). That's a lot of individual writes.

I did some I/O testing a few years back, and discovered h5py (and PyTables) write performance is dominated by the number of I/O operations and NOT the size of the dataset being written. Take a look at this Answer: pytables writes much faster than h5py. Why? It compares I/O performance (for h5py and PyTables) when writing the same total amount of data using different sizes of I/O data blocks. The first key finding applies here: Total time to write all of the data was a linear function of the # of loops (for both PyTables and h5py).

The way to improve run time is to reduce the number of I/O loops. Some ideas:

  • Is there a way you can collect the training, validation, and test samples in NumPy arrays, then write all at once as a single dataset?
  • If not, can you size and create 3 empty datasets (for training, validation, test) then write the data in each loop to the appropriate dataset and index? This might save time since you are only writing and not allocating. (Need to test to be sure).
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Thank you - I think you're right re writing data being a linear function of the # of subsamples. I cannot create too large a numpy array than could be stored in memory, which is why I was writing to a .hdf5. unfortunately by the 3rd image the operation time is around 30 minutes, so that isn't feasible. – Matt Plaudis Mar 16 '21 at 22:18
  • Here's another idea to minimize I/O operations. Create 3 datasets (training, validation, test) plus 1 additional dataset with HDF5 region references to the appropriate sets. You can allocate the datasets to hold all the data and avoid reallocation overhead. Then loop thru your data and stack up "a lot" of samples and write to the dataset (whatever you can hold in RAM). With some more details, I think I could create a prototype using dummy data (a random Numpy array). – kcw78 Mar 16 '21 at 23:55