0

I have a (22500, 516, 516), uint16 h5py dataset which I want to replace piece by piece after processing the data.

To do that I load several chunks of the data the following way (the chunk shape is (1,129,129)):

chunk = data[:,
         i1*129:(i1+1)*129,
         i2*129:(i2+1)*129].astype(pl.float32)

where data is the dataset and i1, i2 indices that both run from 0 to 3 in a newsted loop.

Later in the loop I write the processed data:

data[:,
     i1*129:(i1+1)*129,
     i2*129:(i2+1)*129] = chunk.astype(pl.uint16)

Here, I experience a very long delay, the process will become uninterruptible (state D) and cause 0% cpu load. The memory usage is about 1%. What's more, different ssh sessions to this PC or servers having the same drive mounted hardly respond. It seems frozen for some time.

However, if I create a new dataset before the loop

datanew = entry.create_dataset("data_new",
                             shape=data.shape,
                             chunks=data.chunks,
                             dtype=data.dtype,
                             compression="gzip",
                             compression_opts=4)

and write to this dataset instead, I don't experience any problems and the performance is quite good.

The only difference of the new dataset is that the original one used lzf compression.

Is there any way to understand what is wrong here?

Thanks

dnalow
  • 974
  • 4
  • 14

1 Answers1

2

On which storage device is your HDF5 File (local ssd/harddisk or NAS)?

Maybe you run into problems due to file fragmentation. Chunks are normaly read and written sequentially.

If you overwrite a compressed chunk with a bigger compressed chunk, which may happen when using compressed datasets, chunks may be end up fragmented on disk. The performance effect will depend on the latency of your storage device (NAS >> local harddrive >> SSD).

If you see this effect i would recommend the following:

You may also want to increase your chunk size to get better performance when accessing a file on a storage device with high latency. If you access your dataset only in the way shown above, you could increase the chunksize for example to (50,129,129) or even more. Some simple benchmakrs regarding the chunksize on different storage devices: https://stackoverflow.com/a/44961222/4045774

max9111
  • 6,272
  • 1
  • 16
  • 33
  • The system is an HPC node with some kind of high throughput connection to a shared volume. So the transfer rate is quite high, but I actually don't know about the latency.. I guess its somewhere between NAS and localdrive in you categorization. I guess you are pointing in the correct direction: As the data becomes more 'dense' after processing, it takes more space after compression. I now wrote everything to a separate file and later replaced the whole data in the original file like: data[:] = datanew.value. I hope this is a valid approach. I dont want the files to become larger than necessary – dnalow Jul 21 '17 at 08:20
  • I would really do it in the way, i desceibed above. (Deleting the original Dataset and copy the temporary file). You also have to compress the data only once if you copy the dataset. – max9111 Jul 21 '17 at 10:48
  • But this is at the expense of disc space (which will not be freed by deleting as you pointed out). This way it will take roughly double the space in the end. – dnalow Jul 21 '17 at 15:18
  • If you delete the dataset the space is actually freed, but not given back to the file system. If you create a new dataset the free space in the file is actually reused. But if you delete a whole dataset, you have a large consecutive free block in the file, therefore the chunks should not (or much less) be fragmented. – max9111 Jul 21 '17 at 15:39
  • Was the result satisfying (Is the perfomance of the dataset copying as fast then your sequential I/O speed)? – max9111 Jul 25 '17 at 11:49
  • I tried different things: writing in a separate file or a separate dataset. All were faster. So I guess the fragmentation due to higher data density after processing is really the issue. BTW: When writing to a separate file and subsequently replacing the original data by `data[:] = newdata.value`, I bumped into another issue similar to that one: https://stackoverflow.com/questions/25182199/pytables-writing-error – dnalow Jul 25 '17 at 13:19
  • Some more feedback: Deleting the dataset and writing a new one with comparable size finally resulted in a file roughly 2x as big. So the space was not reused :/ – dnalow Jul 26 '17 at 09:49
  • Thank's for your feedback. Actually the HDF5 Library has the ability to reuse file space. https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesFileSpaceMgmtDocs.html Also try the latest libversion f = h5py.File('name.hdf5', libver='latest') # most modern Note that H5 Files stored with this parameter aren't compatible with older HDF5 Libs... – max9111 Jul 26 '17 at 11:25
  • 1
    closing and re-opening the h5 file between delete and write helped!! – dnalow Jul 26 '17 at 18:27