HDF5 tagging datasets to events in other datasets

Question

I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.

Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.

I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.

Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.

Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!

score 0 · Accepted Answer · answered Jun 17 '22 at 16:45

How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.

Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.

To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:

Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)

Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)

There are at least 3 other SO questions and answers on this topic:

Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.

import numpy as np
import h5py

log_ntimes = 31
log_inc = 1e-3

arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
    time = i*log_inc
    arr[i,0] = time
    #temp = 70.+ 100.*time
    #print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')

arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)

with h5py.File('SO_72654160.h5','w') as h5f:
    h5f.create_dataset('data_log',data=arr)

n_bursts = 4    
burst_ntimes = 11
burst_inc = 5e-5

for n in range(1,n_bursts):
    arr = np.zeros((burst_ntimes-1,2))
    for i in range(1,burst_ntimes):
        burst_time = 0.01*(n)
        time = burst_time + i*burst_inc
        arr[i-1,0] = time
        #temp = 70.+ 100.*t
    arr[:,1] = 70.+ 100.*arr[:,0]    
    
    with h5py.File('SO_72654160.h5','a') as h5f:
        h5f.create_dataset(f'burst_log_{n:02}',data=arr)

Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)

source_file = 'SO_72654160.h5'

a0 = 0
with h5py.File(source_file, 'r') as h5f:
    for ds_name in h5f:
        a0 += h5f[ds_name].shape[0]

print(f'Total data rows in source = {a0}')

# alternate getting data from
#   dataset: data_log, get rows 0-11, 11-21, 21-31 
#   datasets: burst_log_01, burst log_02, etc (each has 10 rows)

# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)

# Map virstual dataset to logged data 
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2

layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2

layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
   
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')

# Open EXISTING file, then add virtual dataset 
with h5py.File('SO_72654160.h5', 'a') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')

Very clever! Thanks. I'll have to add some magic sauce to work out what indexes to add each 'burst' to the virtual dataset as they will be at odd intervals, and most likely be different shapes too! But this should keep me busy for a bit, thanks a lot — Ricky Millar, Jun 18 '22 at 19:49
Yeah, my data was simple, and I "cheated" by hard coding the indices. I think it will be pretty simple to compare times in the standard rate and burst rate files then interleave as appropriate. Also, I had a another idea using a heterogeneous data structure: you could have standard log time and temps in columns 1&2, and object data in column 3 with the regions references to burst rate data. Not sure if that's better since you still have to bounce between 2 datasets to get all the Time/Temp values. — kcw78, Jun 19 '22 at 16:53

HDF5 tagging datasets to events in other datasets

1 Answers1

Linked