(Python/h5py) Memory-efficient processing of HDF5 dataset slices

Question

I'm working in python with a large dataset of images, each 3600x1800. There are ~360000 images in total. Each image is added to the stack one by one since an initial processing step is run on each image. H5py has proven effective at building the stack, image by image, without it filling the entire memory.

The analysis I am running is calculated on the grid cells - so on 1x1x360000 slices of the stack. Since the analysis of each slice depends on the max and min values of within that slice, I think it is necessary to hold the 360000-long array in memory. I have a fair bit of RAM to work with (~100GB) but not enough to hold the entire stack of 3600x1800x360000 in memory at once.

This means I need (or I think I need) a time-efficient way of accessing the 360000-long arrays. While h5py is efficient at adding each image to the stack, it seems that slicing perpendicular to the images is much, much slower (hours or more).

Am I missing an obvious method to slice the data perpendicular to the images?

Code below is a timing benchmark for 2 different slice directions:

file = "file/path/to/large/stack.h5"

t0 = time.time()
with h5py.File(file, 'r') as f:
    dat = f['Merged_liqprec'][:,:,1]
    
print('Time = ' + str(time.time()- t0))

t1 = time.time()
with h5py.File(file, 'r') as f:   
    dat = f['Merged_liqprec'][500,500,:]
    
print('Time = ' + str(time.time()- t1))

Output:

## time to read a image slice, e.g. [:,:,1]:
Time = 0.0701  

## time to read a slice thru the image stack, e.g. [500,500,:]:
Time = multiple hours, server went offline for maintenance while running

The *seek time* of an hard disk drive is huge: about 4-10 ms in average. Thus, non-contiguous reads are very expensive. The OS can read all the file so to avoid seek time even though 99% of the content will be waster in that case. 360000 seeks will likely take at least dozens of minutes. For SSD, this is far better since they can reach hundreds of thousands of fetches per second. On such storage device, fetching 360000 values will still take few seconds and be very inefficient. That being said, close values will be likely cached (in RAM) by the OS. Even random accesses in RAM is much slower. — Jérôme Richard, Sep 20 '22 at 21:36
The golden rule of storage device is to read/write data in *large chunks* (eg. at least 256KiB on HDDs and at least 4 KiB on SSDs). A solution to make that simpler is to do an optimized multi-pass tiled transposition, then do your computation, then do the transposition again (or write transposed data in a new file if possible). This is not as good as doing directly a chunk-based implementation. — Jérôme Richard, Sep 20 '22 at 21:43
@Robert: interesting use-case you have there! A bit off-track, but would you mind to explain the reasoning behind organizing a stack of images as a three-dimensional array of 3600x1800x360000 instead of 360000x3600x1800? Sometimes, I see the last dimension being used to differentiate images which seems less intuitive/natural than using the first dimension. Thanks! — user18371559, Sep 22 '22 at 07:25
@Jérôme Richard make a good point about I/O bottleneck and reading large chunks on each read (e.g.,array slice > `[1,1,360000]`). Couple that with HDF5 chunked data storage, and my tests shows dramatic improvement (at least for the small 2.6GB file I can create). I am going to add another answer that shows this approach. It looks very promising. — kcw78, Sep 22 '22 at 18:17
You have to set up a proper sized chunk-cache (by default only 1MB, but can be set to some GB) and find the optimal chunk-shape. https://stackoverflow.com/a/48405220/4045774 Also avoid closing the file in between `with ...`. Also to remember chunks are read/written always entirely, eg. with a chunk shape of `1x1x360000` you will read the whole file in your first example. — max9111, Sep 30 '22 at 12:30

score 1 · Answer 1 · answered Sep 20 '22 at 21:22

You have the right approach. Using numpy slicing notation to read the stack of interest reduces the memory footprint.

However, with this large dataset, I suspect I/O performance is going to depend on chunking: 1) Did you define chunks= when you created the dataset, and 2) if so, what is the chunked size? The chunk is the shape of the data block used when reading or writing. When an element in a chunk is accessed, the entire chunk is read from disk. As a result, you will not be able to optimize the shape for both writing images (3600x1800x1) and reading stacked slices (1x1x360000). The optimal shape for writing an image is (shape[0], shape[1], 1) and for reading a stacked slice is (1, 1, shape[2])

Tuning the chunk shape is not trivial. h5py docs recommend the chunk size should be between 10 KiB and 1 MiB, larger for larger datasets. You could start with chunks=True (h5py determines the best shape) and see if that helps. Assuming you will create this file once, and read many times, I suggest optimizing for reading. However, creating the file may take a long time. I wrote a simple example that you can use to "tinker" with the chunk shape on a small file to observe the behavior before you work on the large file. The table below shows the effect of different chunk shapes on 2 different file sizes. The code follows the table.

Dataset shape=(36,18,360)

chunks	Writing	Reading
None	0.349	0.102
(36,18,1)	0.187	0.436
(1,1,360)	2.347	0.095

Dataset shape=(144,72,1440) (4x4x4 larger)

chunks	Writing	Reading
None	59.963	1.248
(9, 9, 180) / True	11.334	1.588
(144, 72, 100)	79.844	2.637
(10, 10, 1440)	56.945	1.464

Code below:

f = 4
a0, a1, n = f*36, f*18, f*360
c0, c1, cn = a0, a1, 100
#c0, c1, cn = 10, 10, n

arr = np.random.randint(256,size=(a0,a1))

print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
     # Use chunks=True for default size or use chunks=(c0,c1,cn) for user defined size
     ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True) 
     print(f'chunks={ds.chunks}')
     for i in range(n):
         ds[:,:,i] = arr
print(f"Time to create file:{(time.time()-start): .3f}")

start = time.time()       
with h5py.File('SO_73791464.h5','r') as h5f:
     ds = h5f['images']
     for i in range(ds.shape[0]):
         for j in range(ds.shape[1]):    
             dat = ds[i,j,:]
print(f"Time to read file:{(time.time()-start): .3f}")

score 1 · Answer 2 · answered Sep 22 '22 at 20:40

HDF5 chunked storage improves I/O time, but can still be very slow when reading small slices (e.g., [1,1,360000]). To further improve performance, you need to read larger slices into an array as described by @Jérôme Richard in the comments below your question. You can then quickly access a single slice from the array (because it is in memory).

This answer combines the 2 techniques: 1) HDF5 chunked storage (from my first answer), and 2) reading large slices into an array and then reading a single [i,j] slice from that array. The code to create the file is very similar to the first answer. It is setup to create a dataset of shape [3600, 1800, 100] with default chunk size ([113, 57, 7] on my system). You can increase n to test with larger datasets.

When reading the file, the large slice [0] and [1] dimensions are set equal to the associated chunk shape (so I only access each chunk once). As a result, the process to read the file is "slightly more complicated" (but worth it). There are 2 loops: the 1st loop reads a large slice into an array 'dat_chunk', and the 2nd loop reads a [1,1,:] slice from 'dat_chunk' into a second array 'dat'.

Differences in timing data for 100 images is dramatic. It takes 8 seconds to read all of the data using the method below. It required 74 min with the first answer (reading each [i,j] pair directly). Clearly that is too slow. Just for fun, I increased the dataset size to 1000 images (shape=[3600, 1800, 1000]) in my test file and reran. It takes 4:33 (m:ss) to read all slices with this method. I didn't even try with the previous method (for obvious reasons). Note: my computer is pretty old & slow with a HDD, so your timing data should be faster.

Code to create the file:

a0, a1, n = 3600, 1800, 100
print(f'Writing dataset shape=({a0},{a1},{n})')
start = time.time()
with h5py.File('SO_73791464.h5','w') as h5f:
     ds = h5f.create_dataset('images',dtype=int,shape=(a0,a1,n),chunks=True) 
     print(f'chunks={ds.chunks}')
     for i in range(n):
          arr = np.random.randint(256,size=(a0,a1))
          ds[:,:,i] = arr
 print(f"Time to create file:{(time.time()-start): .3f}")

Code to read the file using large array slices:

start = time.time()       
with h5py.File('SO_73791464.h5','r') as h5f:
    ds = h5f['images']
    print(f'shape={ds.shape}')
    print(f'chunks={ds.chunks}')
    ds_i_max, ds_j_max, ds_k_max = ds.shape
    ch_i_max, ch_j_max, ch_k_max = ds.chunks
    
    i = 0
    while i < ds_i_max:
        i_stop = min(i+ch_i_max,ds_i_max)    
        print(f'i_range: {i}:{i_stop}')
        j = 0
        while j < ds_j_max:
            j_stop = min(j+ch_j_max,ds_j_max)    
            print(f'  j_range: {j}:{j_stop}')
            dat_chunk = ds[i:i_stop,j:j_stop,:]
            # print(dat_chunk.shape)
            for ic in range(dat_chunk.shape[0]):
                for jc in range(dat_chunk.shape[1]):
                    dat = dat_chunk[ic,jc,:]
            j = j_stop
        i = i_stop     
              
print(f"Time to read file:{(time.time()-start): .3f}")

(Python/h5py) Memory-efficient processing of HDF5 dataset slices

2 Answers2