If you implement chunking correctly, writing this much data should be "relatively fast" (minutes, not hours). Without chunking, this could take a very long time.
To demonstrate how to use chunking (and provide timing benchmarks), I wrote a short code segment that populates a dataset with some random data (of type np.float32
). I create and write the data incrementally because don't have enough RAM to store an array of size (16_813,60,44,257)
in memory.
Answer updated on 2022-04-04 This update addresses code posted in comments on 2022-04-02. I modified my example to write data with shape=(1,60,44,257)
instead of shape=(16_813,60,44,1)
. I think this matches the array shape you are writing. I also modified the chunk shape to match and added variables to define data array and chunk shape (to simplify benchmarking runs for different chunk and data I/O sizes). I ran tests for 3 combinations:
- arr shape=(1,60,44,257) and chunks=(1,60,44,257) [2.58MB]; runs in 379 sec (6m 19s)
- arr shape=(1,60,44,257) and chunks=(100,60,44,257) [258MB]; runs in 949 sec (15m 49s)
- arr shape=(43,60,44,257);nloops=391 and chunks=(1,60,44,257); runs in 377 sec (6m 17s)
Tests 1 and 2 show influence of chunk size on performance. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. You can see performance degrades significantly in test 2 with the 258MB chunk size. This might account for some of your problem, but should not cause your system to freeze after writing 5GB data (IMHO).
Tests 1 and 3 show influence of write array size on performance. I have found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py In this case, you can see performance is not affected by increasing write array size. In other words, writing 1 row at a time does not affect performance.
Note: I did not add compression. This reduces the on-disk file size, but increases I/O time to compress/uncompressed the data on-the fly. Created file size is 42.7 GB.
Tests were run on an old Windows system with 24GB RAM and a mechanical HDD (6gbps @ 7200rpm). You should get much faster times with a SSD.
Updated code below:
# dimensions of dataset
ds_a0, ds_a1, ds_a2, ds_a3 = 16_813, 60, 44, 257
# dimensions of chunk shape
ch_a0, ch_a1, ch_a2, ch_a3 = 1, ds_a1, ds_a2, ds_a3
# dimensions of data array
ar_a0, ar_a1, ar_a2, ar_a3 = 1, ds_a1, ds_a2, ds_a3
nloops = 16_813
with h5py.File('sequences.h5', 'w') as h5f:
ds = h5f.create_dataset('X_train', shape=(ds_a0,ds_a1,ds_a2,ds_a3),
chunks=(ch_a0,ch_a1,ch_a2,ch_a3), dtype=np.float32)
start = time.time()
r_cnt = 0
incr = time.time()
for i in range(nloops):
arr = np.random.random(ar_a0*ar_a1*ar_a2*ar_a3).astype(np.float32).reshape(ar_a0,ar_a1,ar_a2,ar_a3)
ds[r_cnt:r_cnt+ar_a0,:,:,:] = arr
r_cnt += ar_a0
if (i+1)%100 == 0 or i+1 == nloops:
print(f'Time for 100 loops after loop {i+1}: {time.time()-incr:.3f}')
incr = time.time()
print(f'\nTotal time: {time.time()-start:.2f}')