Append simulation data using HDF5

Question

I currently run a simulation several times and want to save the results of these simulations so that they can be used for visualizations.

The simulation is run 100 times and each of these simulations generates about 1 million data points (i.e. 1 million values for 1 million episodes), which I now want to store efficiently. The goal within each of these episodes would be to generate an average of each value across all 100 simulations.

My main file looks like this:

# Defining the test simulation environment
def test_simulation:
    environment = environment(
            periods = 1000000
            parameter_x = ...
            parameter_y = ...
      )

    # Defining the simulation
    environment.simulation()

    # Save simulation data
    hf = h5py.File('runs/simulation_runs.h5', 'a')
    hf.create_dataset('data', data=environment.value_history, compression='gzip', chunks=True)
    hf.close()

# Run the simulation 100 times
for i in range(100):
    print(f'--- Iteration {i} ---')
    test_simulation()

The value_history is generated within game(), i.e. the values are continuously appended to an empty list according to:

def simulation:
    for episode in range(periods):
        value = doSomething()
        self.value_history.append(value)

Now I get the following error message when going to the next simulation:

ValueError: Unable to create dataset (name already exists)

I am aware that the current code keeps trying to create a new file and generates an error because it already exists. Now I am looking to reopen the file created in the first simulation, append the data from the next simulation and save it again.

What other methods does `hf` have which are not named `create_dataset` but similar to `update_dataset`? — mkrieger1, Jul 29 '21 at 11:38
Or alternatively, why do you not use a different name each time, instead of always `'data'`? — mkrieger1, Jul 29 '21 at 11:39
Thanks for your comment and the edit! Indeed, I already had the idea of different names (but I couldn't figure out yet how to do it best in the main file). Since I will run the simulation of 100 runs several times (with different parameters) this would perhaps also become a bit messy. — jey-ronimo, Jul 29 '21 at 11:45
You could use a resizable dataset (add parameter `maxshape=()` when you create and use the `Dataset.rezize()` method ). However, given the cumulative size of your data (100 X 1M X 1M) you might prefer 1 dataset for each simulation run. Name them `data_###` where `###` is `i` in the simulation for loop. — kcw78, Jul 29 '21 at 13:41

kcw78 · Accepted Answer · 2021-07-29T17:33:25.620

The example below shows how to pull all these ideas together. It creates 2 files:

Create 1 resizable dataset with maxshape() parameter on first loop, then use dataset.resize() on subsequent loops -- output is simulation_runs1.h5
Create a unique dataset for each simulation -- output is simulation_runs2.h5.

I created a simple 100x100 NumPy array of randoms for the "simulation data", and ran the simulation 10 times. They are variables, so you can increase to larger values to determine which method is better (faster) for your data. You may also discover memory limitations saving 1M data points for 1M time periods.
Note 1: If you can't save all the data in system memory, you can incrementally save simulation results to the H5 file. It's just a little more complicated.
Note 2: I added a mode variable to control whether a new file is created for the first simulation (i==0) or the existing file is opened in append mode for subsequent simulations.

import h5py
import numpy as np

# Create some psuedo-test data
def test_simulation(i):
    periods = 100
    times = 100

    # Define the simulation with some random data
    val_hist = np.random.random(periods*times).reshape(periods,times)    
    a0, a1 = val_hist.shape[0], val_hist.shape[1]
    
    if i == 0:
        mode='w'
    else:
        mode='a'
        
    # Save simulation data (resize dataset)
    with h5py.File('runs/simulation_runs1.h5', mode) as hf:
        if 'data' not in list(hf.keys()):
            print('create new dataset')
            hf.create_dataset('data', shape=(1,a0,a1), maxshape=(None,a0,a1), data=val_hist, 
                              compression='gzip', chunks=True)
        else:
            print('resize existing dataset')
            d0 = hf['data'].shape[0]
            hf['data'].resize( (d0+1,a0,a1) )
            hf['data'][d0:d0+1,:,:] = val_hist
 
    # Save simulation data (unique datasets)
    with h5py.File('runs/simulation_runs2.h5', mode) as hf:
        hf.create_dataset(f'data_{i:03}', data=val_hist, 
                          compression='gzip', chunks=True)

# Run the simulation 100 times
for i in range(10):
    print(f'--- Iteration {i} ---')
    test_simulation(i)

Thank you very much for your help! This is exactly what I was looking for. — jey-ronimo, Jul 29 '21 at 17:00

Append simulation data using HDF5

1 Answers1

Linked