0

After preparing data from a dataset, I want to save the prepared data using h5py. The data is a float32 numpy array of shape (16813, 60, 44, 257). Preparing the data is very fast, only a few seconds to prepare 13GB of data. But when I try to write the data to disk (500mb/s SSD) using h5py it gets very slow (waited for ours) and it even freezes/crashes the computer.

hf = h5py.File('sequences.h5', 'a')
hf.create_dataset('X_train', data=X_train)
hf.create_dataset('Y_train', data=Y_train)
hf.close()

I calculated that the data in memory should be around 160GB. Why is it so slow? I tried multiple things like compressing, chunking, predefine shape and write while preparing.

frankatank
  • 23
  • 5
  • How big is the array (in GB)? First you say 13GB, then later 160GB. I estimate 42.5 GB which is consistent with the error message I got when I tried to create an array with that size and type -- `arr = np.empty(((16813, 60, 44, 257)),dtype=np.float32)` Error message: `Unable to allocate 42.5 GiB for array` (I only have 24GB on my system) – kcw78 Mar 31 '22 at 18:05
  • Yeah, sorry I calculated it wrong, before preprocessing 13GB, after it should be 42.5GB. I can create that array but not save. – frankatank Apr 01 '22 at 19:04
  • Ok, thanks for confirming. Test my example below. It should run in 5 minutes (assuming your system is faster than mine). Assuming it is fast enough, test your code with the same chunk size I used. If that still doesn't help, also write slices of your large array (same as my procedure). – kcw78 Apr 01 '22 at 19:42
  • I will test your example later. I am astounded that I can even assert that array at once. I have 8GB of RAM, and asserting that array takes like 30 seconds, that doesn't make sense. I load data from a .h5 file (13GB of audio data) and prepare it to the 42GB array. – frankatank Apr 01 '22 at 20:30
  • Good question about accessing the data. Hard to say without seeing the code. I suspect you are creating h5py dataset objects, and NOT numpy arrays. The objects behave like arrays, but have a much smaller memory footprint. BTW, How does 13GB of audio data grow to 42GB? – kcw78 Apr 02 '22 at 01:46
  • That's a good point, I'm creating a h5py object there, maybe that is why it takes so long to write? I tried out chunking but still, system almost crashes (windows and linux). I slice the audio spectrogram into overlapping patches, that is how I get 42GB (16813 sequences of 60 patches each patch the size of 44x257). – frankatank Apr 02 '22 at 08:51
  • That makes sense. With a h5py object, your process has to read the slice of the audio data from the 1st file into memory, then write to the 2nd file. And since they are overlapping slices, I suspect you read some data repeatedly. Do you know if the 1st file is chunked? If so does the chunk shape "match" the slice shape? If it's not "yes" to both, your bottleneck could be reading the data. – kcw78 Apr 02 '22 at 12:53
  • First file is not chunked. Reading and slicing takes like 20 seconds. I can also do training on the files then, but not write them. But no I dont read repeadatly, I read once into memory and then slice. – frankatank Apr 02 '22 at 13:12
  • It is hard to comment further. Remember: you _**do not read**_ the data into memory when you create a h5py object (`ds = h5file['dataset']` -- `ds` is an object, not an array). That's why you can read a 13GB file when you only have 8GB RAM. The data is read when you access a slice; as h5file['dataset'][a slice]` or `ds[a slice]`. Clearly you have a bottleneck somewhere. You will have to benchmark each phase independently: reading, slicing, writing. Run my example to test write performance. – kcw78 Apr 02 '22 at 15:28
  • I tried your example, see: `h5f = h5py.File('sequences.h5', 'w') X_seq = h5f.create_dataset('X_train', shape=(16813, 60, 44, 257), chunks=(100, 60, 44, 257), dtype=np.float32) for species in tqdm(list(labels)): S_db = prepared_set.get(species) seq = getSequences(S_db) for i, s in enumerate(seq): X_seq[i] = s h5f.close()` – frankatank Apr 02 '22 at 16:57
  • But still, after 5GB of writing it totally freezes (I assume RAM is full). – frankatank Apr 02 '22 at 17:00
  • Chunk shape is large (258MB). That _**might**_ be part of your problem (but system shouldn't freeze). I updated my answer to resemble your process. I suggest running Test 1 to confirm it works (or not). With that info, you can determine where to focus your attention. – kcw78 Apr 04 '22 at 14:06

1 Answers1

2

If you implement chunking correctly, writing this much data should be "relatively fast" (minutes, not hours). Without chunking, this could take a very long time.

To demonstrate how to use chunking (and provide timing benchmarks), I wrote a short code segment that populates a dataset with some random data (of type np.float32). I create and write the data incrementally because don't have enough RAM to store an array of size (16_813,60,44,257) in memory.

Answer updated on 2022-04-04 This update addresses code posted in comments on 2022-04-02. I modified my example to write data with shape=(1,60,44,257) instead of shape=(16_813,60,44,1). I think this matches the array shape you are writing. I also modified the chunk shape to match and added variables to define data array and chunk shape (to simplify benchmarking runs for different chunk and data I/O sizes). I ran tests for 3 combinations:

  1. arr shape=(1,60,44,257) and chunks=(1,60,44,257) [2.58MB]; runs in 379 sec (6m 19s)
  2. arr shape=(1,60,44,257) and chunks=(100,60,44,257) [258MB]; runs in 949 sec (15m 49s)
  3. arr shape=(43,60,44,257);nloops=391 and chunks=(1,60,44,257); runs in 377 sec (6m 17s)

Tests 1 and 2 show influence of chunk size on performance. h5py docs recommend keeping chunk size between 10 KB and 1 MB -- larger for larger datasets. Ref: h5py Chunked Storage. You can see performance degrades significantly in test 2 with the 258MB chunk size. This might account for some of your problem, but should not cause your system to freeze after writing 5GB data (IMHO).

Tests 1 and 3 show influence of write array size on performance. I have found write performance degrades when I/O data blocks are "too small". Ref: pytables writes much faster than h5py In this case, you can see performance is not affected by increasing write array size. In other words, writing 1 row at a time does not affect performance.

Note: I did not add compression. This reduces the on-disk file size, but increases I/O time to compress/uncompressed the data on-the fly. Created file size is 42.7 GB.

Tests were run on an old Windows system with 24GB RAM and a mechanical HDD (6gbps @ 7200rpm). You should get much faster times with a SSD.

Updated code below:

# dimensions of dataset  
ds_a0, ds_a1, ds_a2, ds_a3 = 16_813, 60, 44, 257
# dimensions of chunk shape 
ch_a0, ch_a1, ch_a2, ch_a3 = 1, ds_a1, ds_a2, ds_a3
# dimensions of data array 
ar_a0, ar_a1, ar_a2, ar_a3 = 1, ds_a1, ds_a2, ds_a3
nloops = 16_813

with h5py.File('sequences.h5', 'w') as h5f:
    ds = h5f.create_dataset('X_train', shape=(ds_a0,ds_a1,ds_a2,ds_a3),
                            chunks=(ch_a0,ch_a1,ch_a2,ch_a3), dtype=np.float32)   
    start = time.time()
    r_cnt = 0
    incr = time.time()
    for i in range(nloops):
        arr = np.random.random(ar_a0*ar_a1*ar_a2*ar_a3).astype(np.float32).reshape(ar_a0,ar_a1,ar_a2,ar_a3)            
        ds[r_cnt:r_cnt+ar_a0,:,:,:] = arr
        r_cnt += ar_a0
        if (i+1)%100 == 0 or i+1 == nloops:
            print(f'Time for 100 loops after loop {i+1}: {time.time()-incr:.3f}')
            incr = time.time()            
        
    print(f'\nTotal time: {time.time()-start:.2f}')
kcw78
  • 7,131
  • 3
  • 12
  • 44