The key to reading data streaming over a high latency, high bandwidth network connection is to reduce the number of calls to read(n)
on the file, these calls are sequential. HDF5 has a feature called the User Block Size which is set when the file is created or reset by using the h5repack
tool.
The user block size is described in the SO article below. To summarize it here, data is stored in chunks of a user specified dimension. For example a table with shape 1Mx128 could have a block size of 10kx1 which would store data in 10k chunks (1 column).
What is the block size in HDF5?
When reading data from a python object (which is typical if you have a network accessed file) any access to the data will result in about a half dozen small header reads, then the data reads will be 1 read(n)
per each user block size. Calls to read(n)
are (unfortunately) sequential, so many small reads will be slow over the network. So setting a block size to something reasonable for your use case will reduce the number of read(n)
calls.
Note that there can often be a tradeoff here. Setting a block size of 10kx128 forces all 128 columns to be read, you can't read just 1 column with that block size. But setting a block size of 10kx1 means that a read of all 128 channels will result in 128 read(n)
calls per each 10k rows.
If your data is not packed efficiently for your purpose you can repack it (a slow one-time process that doesn't change the data, just the packing order) using h5repack
.