I have a (22500, 516, 516), uint16 h5py dataset which I want to replace piece by piece after processing the data.
To do that I load several chunks of the data the following way (the chunk shape is (1,129,129)):
chunk = data[:,
i1*129:(i1+1)*129,
i2*129:(i2+1)*129].astype(pl.float32)
where data
is the dataset and i1, i2
indices that both run from 0 to 3 in a newsted loop.
Later in the loop I write the processed data:
data[:,
i1*129:(i1+1)*129,
i2*129:(i2+1)*129] = chunk.astype(pl.uint16)
Here, I experience a very long delay, the process will become uninterruptible (state D) and cause 0% cpu load. The memory usage is about 1%. What's more, different ssh sessions to this PC or servers having the same drive mounted hardly respond. It seems frozen for some time.
However, if I create a new dataset before the loop
datanew = entry.create_dataset("data_new",
shape=data.shape,
chunks=data.chunks,
dtype=data.dtype,
compression="gzip",
compression_opts=4)
and write to this dataset instead, I don't experience any problems and the performance is quite good.
The only difference of the new dataset is that the original one used lzf
compression.
Is there any way to understand what is wrong here?
Thanks