0

We have access to a multi gigabyte HDF5 file as it's being written over the course of many minutes. We would like to pull the most recent data written to the file as it becomes available (sub second time-frame).

Is there any way to read an HDF5 file as a stream of bytes as they are written?

I see this question (Read HDF5 in streaming in java) w.r.t. Java which seems to suggest streaming might be possible with lower level HDF5 tools, but aren't in that particular java package.

Of particular note the H5PY python package has a set of low level APIs which I'm not familiar enough with to know if they offer a solution.

https://api.h5py.org/

David Parks
  • 30,789
  • 47
  • 185
  • 328
  • Requesting software or libraries is off-topic on StackOverflow – DisappointedByUnaccountableMod May 12 '21 at 21:50
  • I'm not asking for a software library. I'm asking if it possible to read the HDF5 format in a real-time streaming manner. In particular in the Python environment. But an answer in any environment would be useful as well. – David Parks May 12 '21 at 21:52
  • Read [ask] - you don’t reference any research or searching you’ve done, and you don’t show code of an honest attempt to solve your problem. As you know, StackOverflow isn’t a code-writing service. – DisappointedByUnaccountableMod May 12 '21 at 21:55
  • I've edited the question to reference a low level library in H5PY that I've been trying to understand well enough to know if it solves my problem. My core problem is that I don't know how to go about reading HDF5 as a real time stream, nor can I find other questions other than the one I have already referenced. I am trying to understand if it simply possible to achieve real time streaming with HDF5 or not. – David Parks May 12 '21 at 21:58
  • StackOverflow is not a free coding service. You're expected to try to solve the problem first. Please update your question to show what you have already tried in a minimal reproducible example. For further information, please see How to Ask, and take the tour – DisappointedByUnaccountableMod May 12 '21 at 21:59
  • 1
    In practical terms, the answer is "no". HDF5 is in somewhat similar to a mini file system, with subdirectories and files. The directories don't get fixed up until the files are complete. – Tim Roberts May 12 '21 at 22:16
  • @TimRoberts That's a good answer as long as you're confident in that, I'll accept that answer. I suspected that would be the case, but I wanted to be sure. Especially since the question I referenced there seemed to suggest that maybe the Java library maintainer for the HDF5 java package wanted to make it a feature in the future. – David Parks May 12 '21 at 22:20
  • 1
    Is it impossible? No, it's not impossible. Is it practical today? No, it's just not. – Tim Roberts May 12 '21 at 22:21
  • 1
    @David Parks, I would check with the developers: The HDF Group is the best source to ask about capabilities to access a file simultaneously. They have a forum, with a h5py specific channel. – kcw78 May 13 '21 at 13:15
  • SWMR (https://docs.h5py.org/en/stable/swmr.html ) might be what you want, and can be used from h5py. Though IMO it's a bit of an awkward addition to HDF5, so it might not be what you want, as well. – Thomas K May 28 '21 at 17:38
  • @ThomasK thanks for the reference, that's interesting to see. It looks like it's primarily focused on ensuring a consistent state and synchronization between the producers and consumers. It doesn't look like it explicitly supports streaming read operations though. – David Parks May 28 '21 at 17:53
  • You can have the writer extend a dataset, write some new data into it and flush, then the reader updates, sees there's new data and reads it. Which is kind of streaming. AFAIK, it include a way to notify the reader of new data, though - you either have to check on a timer, or implement a notification some other way. – Thomas K May 30 '21 at 17:20

2 Answers2

0

The key to reading data streaming over a high latency, high bandwidth network connection is to reduce the number of calls to read(n) on the file, these calls are sequential. HDF5 has a feature called the User Block Size which is set when the file is created or reset by using the h5repack tool.

The user block size is described in the SO article below. To summarize it here, data is stored in chunks of a user specified dimension. For example a table with shape 1Mx128 could have a block size of 10kx1 which would store data in 10k chunks (1 column).

What is the block size in HDF5?

When reading data from a python object (which is typical if you have a network accessed file) any access to the data will result in about a half dozen small header reads, then the data reads will be 1 read(n) per each user block size. Calls to read(n) are (unfortunately) sequential, so many small reads will be slow over the network. So setting a block size to something reasonable for your use case will reduce the number of read(n) calls.

Note that there can often be a tradeoff here. Setting a block size of 10kx128 forces all 128 columns to be read, you can't read just 1 column with that block size. But setting a block size of 10kx1 means that a read of all 128 channels will result in 128 read(n) calls per each 10k rows.

If your data is not packed efficiently for your purpose you can repack it (a slow one-time process that doesn't change the data, just the packing order) using h5repack.

David Parks
  • 30,789
  • 47
  • 185
  • 328
0

I think what you are asking for is possible with HDF5 SWMR (Single-Writer/Multiple-Reader). The user guide describes how it works, and there is now support in h5py with examples.

James Mudd
  • 1,816
  • 1
  • 20
  • 25