Your question is similar to a previous SO/h5py question I recently answered: h5py extremely slow writing. Apparently you are getting acceptable write performance, and want to improve read performance.
The 2 most important factors that affect h5py I/O performance are: 1) chunk size/shape, and 2) size of the I/O data block. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. I have also found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py. The size of your read data block is certainly large enough.
So, my initial hunch was to investigate chunk size influence on I/O performance. Setting the optimal chunk size is a bit of an art. Best way to tune the value is to enable chunking, let h5py define the default size, and see if you get acceptable performance. You didn't define the chunks
parameter. However, because you defined the maxshape
parameter, chunking was automatically enabled with a default size (based on the dataset's initial size). (Without chunking, I/O on a file of this size would be painfully slow.) An additional consideration for your problem: the optimal chunk size has to balance the size of the write data blocks (5000 x 40_000) vs the read data blocks (1 x 30_000_000).
I parameterized your code so I could tinker with the dimensions. When I did, I discovered something interesting. Reading the data is much faster when I run it as a separate process after creating the file. And, the default chunk size seems to give adequate read performance. (Initially I was going to benchmark different chunk size values.)
Note: I only created a 78GB file (4_000_000 columns). This takes >13mins to run on my Windows system. I didn't want to wait 90mins to create a 600GB file. You can modify n_blocks=750
if you want to test 30_000_000 columns. :-) All code at the end of this post.
Next I created a separate program to read the data. Read performance was fast with the default chunk size: (40, 625). Timing output below:
Time to read first row: 0.28 (in sec)
Time to read last row: 0.28
Interestingly, I did not get the same read times with every test. Values above were pretty consistent, but occasionally I would get a read time of 7-10 seconds. Not sure why that happens.
I ran 3 tests (In all cases block_to_write.shape=(500,40_000)
):
- default
chunksize=(40,625)
[95KB]; for 500x40_000 dataset (resized)
- default
chunksize=(10,15625)
[596KB]; for 500x4_000_000 dataset (not resized)
- user defined
chunksize=(10,40_000)
[1.526MB]; for 500x4_000_000 dataset (not resized)
Larger chunks improves read performance, but speed with default values is pretty fast. (Chunk size has a very small affect on write performance.) Output for all 3 below.
dataset chunkshape: (40, 625)
Time to read first row: 0.28
Time to read last row: 0.28
dataset chunkshape: (10, 15625)
Time to read first row: 0.05
Time to read last row: 0.06
dataset chunkshape: (10, 40000)
Time to read first row: 0.00
Time to read last row: 0.02
Code to create my test file below:
with h5py.File(fname, 'w') as fout:
blocksize = 40_000
n_blocks = 100
n_rows = 5_000
block_to_write = np.random.random((n_rows, blocksize))
start = time.time()
for cnt in range(n_blocks):
incr = time.time()
print(f'Working on loop: {cnt}', end='')
if "data" not in fout:
fout.create_dataset("data", shape=(n_rows,blocksize),
maxshape=(n_rows, None)) #, chunks=(10,blocksize))
else:
fout["data"].resize((fout["data"].shape[1] + blocksize), axis=1)
fout["data"][:, cnt*blocksize:(cnt+1)*blocksize] = block_to_write
print(f' - Time to add block: {time.time()-incr:.2f}')
print(f'Done creating file: {fname}')
print(f'Time to create {n_blocks}x{blocksize:,} columns: {time.time()-start:.2f}\n')
Code to read 2 different arrays from the test file below:
with h5py.File(fname, 'r') as fin:
print(f'dataset shape: {fin["data"].shape}')
print(f'dataset chunkshape: {fin["data"].chunks}')
start = time.time()
data = fin["data"][0,:]
print(f'Time to read first row: {time.time()-start:.2f}')
start = time.time()
data = fin["data"][-1,:]
print(f'Time to read last row: {time.time()-start:.2f}'