Saving to hdf5 is very slow (Python freezing)

Question

I'm trying to save bottleneck values to a newly created hdf5 file. The bottleneck values come in batches of shape (120,10,10, 2048). Saving one alone batch is taking up more than 16 gigs and python seems to be freezing at that one batch. Based on recent findings (see update, it seems hdf5 taking up large memory is okay, but the freezing part seems to be a glitch.

I'm only trying to save the first 2 batches for test purposes and only the training data set (once again,this is a test run), but I can't even get past the first batch. It just stalls at the first batch and doesn't loop to the next iteration. If I try to check the hdf5, explorer will get sluggish, and Python will freeze. If I try to kill Python (even with out checking hdf5 file), Python doesn't close properly and it forces a restart.

Here is the relevant code and data:

Total data points are about 90,000 ish, released in batches of 120.

Bottleneck shape is (120,10,10,2048)

So the first batch I'm trying to save is (120,10,10,2048)

Here is how I tried to save the dataset:

with h5py.File(hdf5_path, mode='w') as hdf5:
                hdf5.create_dataset("train_bottle", train_shape, np.float32)
                hdf5.create_dataset("train_labels", (len(train.filenames), params['bottle_labels']),np.uint8)
                hdf5.create_dataset("validation_bottle", validation_shape, np.float32)
                hdf5.create_dataset("validation_labels",
                                              (len(valid.filenames),params['bottle_labels']),np.uint8)



 #this first part above works fine

                current_iteration = 0
                print('created_datasets')
                for x, y in train:

                    number_of_examples = len(train.filenames) # number of images
                    prediction = model.predict(x)
                    labels = y
                    print(prediction.shape) # (120,10,10,2048)
                    print(y.shape) # (120, 12)
                    print('start',current_iteration*params['batch_size']) # 0
                    print('end',(current_iteration+1) * params['batch_size']) # 120

                    hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                    hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels
                    current_iteration += 1
                    print(current_iteration)
                    if current_iteration == 3:
                       break

This is the output of the print statements:

(90827, 10, 10, 2048) # print(train_shape)

(6831, 10, 10, 2048)  # print(validation_shape)
created_datasets
(120, 10, 10, 2048)  # print(prediction.shape)
(120, 12)           #label.shape
start 0             #start of batch
end 120             #end of batch

# Just stalls here instead of printing `print(current_iteration)`

It just stalls here for while (20 mins +), and the hdf5 file slowly grows in size (around 20 gigs now, before I force kill). Actually I can't even force kill with task manager, I have to restart the OS, to actually kill Python in this case.

Update

After playing around with my code for a bit, there seems to be a strange bug/behavior.

The relevant part is here:

          hdf5["train_bottle"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = prediction
                hdf5["train_labels"][current_iteration*params['batch_size']: (current_iteration+1) * params['batch_size'],...] = labels

If I run either of these lines, my script will go through the iterations, and automatically break as expected. So there is no freeze if I run either-or. It happens fairly quickly as well -- less than one min.

If I run the first line ('train_bottle'), my memory is taking up about 69-72 gigs, even if it's only a couple of batches. If I try more batches, the memory is the same. So I'm assuming the train_bottle decided storage based on the size parameters I'm assigning the dataset, and not actually when it gets filled. So despite the 72 gigs, it's running fairly quickly (one min).

If I run the second line, train_labels , my memory takes up a few megabytes. There is no problem with the iterations, and break statement is executed.

However, now here is the problem, If I try to run both lines (which in my case is necessary as I need to save both 'train_bottle' and 'train_labels'), I'm experiencing a freeze on the first iteration, and it doesn't continue to the second iteration, even after 20 mins. The Hdf5 file is slowly growing, but if I try to access it, Windows Explorer slows down to a snail and I can't close Python -- I have to restart the OS.

So I'm not sure what the problem is when trying to running both lines -- as if I run the memory hungry train_data line, if works perfectly and ends within a min.

I don't know where the estimate of `16GB` comes from but I think it's a wrong assumption. A single batch needs `120 * 10 * 10 * 2048 * 4 bytes* what is approximately `94MB`. So a full dataset which you want to save has `94 * 90000 MB` what is equal to approximately `9TB`. This is where your error comes from. — Marcin Możejko, Feb 07 '18 at 20:35
Thanks for the reply. It's actually total 90000 images so batches would be (90000/120) = 750 * 94 MB. Which should 7.5 gigs? However, I'm only trying to save the first two batches, which should be 94 *2. As for the estimates, I'm actually checking the file, every 30 seconds or so manually, and I keep seeing it increase to those gigs. I can't figure out if there is a bug in my code that is causing this. I am using an external hard drive, and wonder if that is causing the problem (too slow?). My internal harddrive is nearly full, and I would have to find things to delete to test it. — Moondra, Feb 07 '18 at 22:04
@MarcinMożejko Ah you're right, but I'm only trying to save the first two-three batches. And it doesn't even to get through the first batch and accumulates around 20 gigs. The strange thing is, if I omit the `['train_bottle'] line and just run the `['train_labels']` line, it will get through the first few batches and break as intended (pretty quickly as well). — Moondra, Feb 08 '18 at 17:03
At least this means that the process is not limited by the write speed (unless your disk is limited to about 11MB/s which is not much). — Pierre de Buyl, Feb 08 '18 at 19:12
Have a look at https://stackoverflow.com/a/48954998/4045774 . You have to think of chunkshape, chunkcache and compression here. Your acces pattern (writing AND reading) is also of high importance for giving a correct and performant answer. — max9111, Feb 26 '18 at 15:46

max9111 · Accepted Answer · 2018-03-02T08:55:44.857

Writing Data to HDF5

If you write to a chunked dataset without specifying a chunkshape, h5py will do that automaticly for you. Since h5py can't know how do you wan't to write or read the data from the dataset, this will often end up in a bad performance.

You also use the default chunk-cache-size of 1 MB. If you only write to a part of a chunk and the chunk doesn't fit in the cache (which is very likely with 1MP chunk-cache-size), the whole chunk will be read in memory, modified and written back to disk. If that happens multiple times you will see a performance which is far beyond the sequential IO-speed of your HDD/SSD.

In the following example I assume that you only read or write along your first dimension. If not this has to be modified to your needs.

import numpy as np
import tables #register blosc
import h5py as h5
import h5py_cache as h5c
import time

batch_size=120
train_shape=(90827, 10, 10, 2048)
hdf5_path='Test.h5'
# As we are writing whole chunks here this isn't realy needed,
# if you forget to set a large enough chunk-cache-size when not writing or reading 
# whole chunks, the performance will be extremely bad. (chunks can only be read or written as a whole)
f = h5c.File(hdf5_path, 'w',chunk_cache_mem_size=1024**2*200) #200 MB cache size
dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False)
prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
t1=time.time()
#Testing with 2GB of data
for i in range(20):
    #prediction=np.array(np.arange(120*10*10*2048),np.float32).reshape(120,10,10,2048)
    dset_train_bottle[i*batch_size:(i+1)*batch_size,:,:,:]=prediction

f.close()
print(time.time()-t1)
print("MB/s: " + str(2000/(time.time()-t1)))

Edit The data creation in the loop took quite a lot of time, so I create the data before the time measurement.

This should give at least 900 MB/s throuput (CPU limited). With real data and lower compression ratios, you should easily reach the sequential IO-speed of your harddisk.

Open a HDF5-File with the with statement can also lead to bad performance if you make the mistake to call this block multiple times. This would close and reopen the file, deleting the chunk-cache.

For determination of the right chunk-size I would also recommend: https://stackoverflow.com/a/48405220/4045774 https://stackoverflow.com/a/44961222/4045774

It seems to be working. I need to run a few more tests just to make sure (hopefully by tomorrow) Thank you so much for your detailed post. This is the first time I'm reading about chunked datasets. I found this link which explains what `chunks` are: https://support.hdfgroup.org/HDF5/doc/_topic/Chunking/ I will try to read up on it after I take care of a few things. If you have any other links you recommend, I would appreciate it as chunking is something I'm not too familar with. — Moondra, Feb 27 '18 at 02:38
Do you reach sequential IO-Speed of your storage device? If not the solution isn't optimal. — max9111, Feb 27 '18 at 21:09
I will have to test again via time.time on the dummy set, but it was pretty quick for 4 gigs. I have to check the specs of my external storage but I think it's 7200 rpm so 80-160 MB/s should be the norm? You feel I could get around 500 MB/s? — Moondra, Feb 27 '18 at 21:29
That depends on the compression ratio of the actual data. I think you should get 80-160MB/s * compression ratio. I ran in a processor limit at about 500 M/s using the example above, but this was on an Core i5 3210M. — max9111, Feb 27 '18 at 21:35
Please note also that this isn't the best you can get. The compression filter is only single threaded, and maybe pytables and the blosc filter isn't compiled with enabled avx2. The HDF5-Filter Pipeline can also be a bit slow. For compression and decompression speeds that are achievable take a look at https://github.com/Blosc/python-blosc — max9111, Feb 27 '18 at 21:47
max9111, thanks your test codes here, but I found this line couldn't run on my platform. dset_train_bottle = f.create_dataset("train_bottle", shape=train_shape,dtype=np.float32,chunks=(10, 10, 10, 2048),compression=32001,compression_opts=(0, 0, 0, 0, 9, 1, 1), shuffle=False) after I changed the "compression=32001" to "compression=32000", it could work, I don't know why. — Clock ZHONG, Mar 02 '18 at 03:19
You have tested with the much slower lzf -Filter. https://support.hdfgroup.org/services/filters.html The updated version gives me 190MB/s with lzf and 950 M/s with blosc/lz4. The import tables is important even if it is not directly used in the code, it will register the compression filter. — max9111, Mar 02 '18 at 08:59

score 2 · Answer 2 · answered Mar 03 '18 at 04:31

2

If you have enough DDR memory and want extremely fast data loading&saving performance, please use np.load()&np.save() directly. https://stackoverflow.com/a/49046312/2018567 np.load()&np.save() could provide you fastest data loading and saving performance, so far, I couldn't find any other tools or framework could compete it, even HDF5's performance is only 1/5 ~ 1/7 of it.

answered Mar 03 '18 at 04:31

Clock ZHONG

875
9
23

1

Please note that's the only way to outperform the solution shown above is to use a PCIe SSD. Even then you have to compare a compact dataset with np.save& and np.load to be consistent. eg. f = h5.File(hdf5_path, 'w') f.create_dataset("my_dataset",data=numpy_array) f.close() With this example I get the full bandwith my SATA3 SSD (about MB/s) with almost no CPU usage. But in most times this isn't recommendable, because you loose almost all advantages of HDF5 (writing or reading only parts of a file, compression) – max9111 Mar 05 '18 at 12:44
max9111, We needn't argue which is faster for np.load()&np.save() or for HDF5, you need just replace your HDF5 function with np.save(). My test result shows 2.3GBps(18Gbps) bandwidth on it. it's above 8 times of the HDF5 performance. I believe your computer is much faster than mine, so it's possible 4~5GBps. Please try it, just replace dset_train_bottle() with np.save(). Let us know your test result. It's not a big work effort. – Clock ZHONG Mar 06 '18 at 16:30

score 1 · Answer 3 · answered Jul 19 '19 at 06:07

This answer is more like a comment on the argument between @max9111 and @Clock ZHONG. I wrote this for helping other people wondering which is faster HDF5 or np.save().

I used the code provided by @max9111 and modified it as suggested by @Clock ZHONG. The exact jupyter notebook can be found at https://github.com/wornbb/save_speed_test.

In short, with my spec:

SSD: Samsung 960 EVO
CPU: i7-7700K
RAM: 2133 MHz 16GB
OS: Win 10

HDF5 achieves 1339.5 MB/s while np.save is only 924.9 MB/s (without compression).

Also, as noted by @Clock ZHONG, he/she had a problem with lzf -Filter. If you also have this problem, posted jupyter notebook can be run with conda distribution of python3 with pip installed packages on win 10.

The best way to only save/load large arrays depends on various factors (most important achievable compression ratios). In many cases it is possible to outperform HDF5 (only single threaded compression filters) by a large margin. eg. https://stackoverflow.com/a/56761075/4045774 Also the throughput can vary quite a bit. Which SSD is used? Is it full or empty? How large is the array? (many SSDs have a fast SLC-Cache)... — max9111, Aug 13 '19 at 14:18

Saving to hdf5 is very slow (Python freezing)

Update

3 Answers3

Linked