Compute mean and standard deviation for HDF5 data

Question

I am currently running 100 simulations that computes 1M values per simulation (i.e. per episode/iteration there is one value).

Main Routine

My main file looks like this:

# Defining the test simulation environment
def test_simulation:
    environment = environment(
            periods = 1000000
            parameter_x = ...
            parameter_y = ...
      )

    # Defining the simulation
    environment.simulation()

# Run the simulation 100 times
for i in range(100):
    print(f'--- Iteration {i} ---')
    test_simulation()

The simulation procedure is as follows: Within game() I generate a value_history that is continuously appended:

def simulation:
    for episode in range(periods):
        value = doSomething()
        self.value_history.append(value)

Hence, as a result, for each episode/iteration, I compute one value that is an array, e.g. [1.4 1.9] (player 1 having 1.4 and player 2 having 1.9 in the current episode/iteration).

Storing of Simulation Data

To store the data, I use the approach proposed in Append simulation data using HDF5, which works perfectly fine.

After running the simulations, I receive the following Keys structure:

Keys: <KeysViewHDF5 ['data_000', 'data_001', 'data_002', ..., 'data_100']>

Computing Statistics for Files

Now, the goal is to compute averages and standard deviations for each value in the 100 data files that I run, which means that, in the end, I would have a final_data set consisting of 1M averages and 1M standard deviations (one average and one standard deviation for each row (for each player) across the 100 simulations).

The goal would thus be to get something like the the following structure [average_player1, average_player2], [std_player1, std_player2]:

episode == 1: [1.5, 1.5], [0.1, 0.2]
episode == 2: [1.4, 1.6], [0.2, 0.3]
...
episode == 1000000: [1.7, 1.6], [0.1, 0.3]

I currently use the following code to extract the data storing it into an empty list:

def ExtractSimData(name, simulation_runs, length):
        # Create empty list
        result = []

        # Call the simulation run file
        filename = f"runs/{length}/{name}_simulation_runs2.h5"

        with h5py.File(filename, "r") as hf:

            # List all groups
            print("Keys: %s" % hf.keys())

            for i in range(simulation_runs):
                a_group_key = list(hf.keys())[i]
                data = list(hf[a_group_key])

                for element in data:
                    result.append(element)

The data structure of result looks something like this:

[array([1.9, 1.7]), array([1.4, 1.9]), array([1.6, 1.5]), ...]

First Attempt to Compute Means

I tried to use the following code to come up with a mean score for the first element (the array consists of two elements since there are two players in the simulation):

mean_result = [np.mean(k) for k in zip(*list(result))]

However, this computes the average of each element in the array across the whole list since I appended each data set to the empty list. My goal, however, would be to compute an average/standard deviation across the 100 data sets defined above (i.e. one value is the average/standard deviation across all 100 data sets).

Is there any way to efficiently accomplish this?

The way to do this efficiently is to load your data into NumPy arrays then use the NumPy methods (which are super-fast). To help, I need to understand your data schema . What is the shape of `element`? I'm guessing `(2,1M)` - where the 1st index is player 1 or 2, and the 2nd index is a value? How are the 1M episodes saved? Do you want to calculate statistics for player 1 and 2 all rows across all 100 datasets in 1 file? Note: Your function opens a filename built from `name` and `length` parameters. — kcw78, Aug 04 '21 at 14:17
Thank you very much for your comment. The shape of `element` is always like this: `[0.24 0.32]`, with the first index being player 1's value and the second index being player 2's value. Hence, each element has a length of 2 (the number of players) for the 1M episodes in one simulation. Ideally, I would like to compute statistics for both players and storing it in one final file. I saved the files according to your solution in https://stackoverflow.com/questions/68574916/append-simulation-data-using-hdf5. However, I was uncertain if I open the files correctly to create a list with the values. — jey-ronimo, Aug 04 '21 at 14:35
Still trying to understand how simulations, episodes, and values map to your data schema. Please verify: 100 simulations == 100 datasets (from `hf.keys()`) ; 1M values == number of rows in each simulation dataset `data_nnn`; and 1M episodes = 1M HDF5 files. Is that right? If not, please clarify mapping of simulations, episodes, and values to files, datasets and dataset shape. Understanding the data schema is critical. Coding is easy once that is defined. — kcw78, Aug 04 '21 at 18:20
Thank you for the follow-up questions. I made some additional edits in the text, which should show again how the relation between the components looks like. The data scheme is almost correct: 100 simulations == 100 datasets from ```hf.keys()```; 1M values == number of rows (==episodes) in each simulation dataset ```data_nnn``` (array with value for each player). Hence, the episodes correspond to the rows, i.e. in each episode/iteration a value is computed, which looks like this: ```in episode 1: [0.24 0.32]```, ```in episode 2: 0.34 0.42]```. — jey-ronimo, Aug 04 '21 at 18:46

score 1 · Accepted Answer · answered Aug 04 '21 at 20:37

This calculates mean and standard deviation of episode/player values across multiple datasets in 1 file. I think it's what you want to do. If not, I can modify as needed. (Note: I created a small pseudo-data HDF5 file to replicate what you describe. For completeness, that code is at the end of this post.)

Outline of steps in the procedure summarized below (after opening the file):

Get basic size info from file : dataset count and number of dataset rows
Use values above to size arrays for player 1 and 2 values (variables p1_arr and p2_arr). shape[0] is the episode (row) count, and shape[1] is the simulation (dataset) count.
Loop over all datasets. I used hf.keys() (which iterates over the dataset names). You could also iterate over names in list ds_names created earlier. (I created it to simplify size calculations in step 2). The enumerate() counter i is used to index episode values for each simulation to the correct column in each player array.
To get the mean and standard deviation for each row, use the np.mean() and np.std() functions with the axis=1 parameter. That calculates the mean across each row of simulation results.
Next, load the data into the result dataset. I created 2 datasets (same data, different dtypes) as described below:
a. The 'final_data' dataset is a simple float array of shape=(# of episodes,4), where you need to know what value is in each column. (I suggest adding an attribute to document.)
b. The 'final_data_named' dataset uses a NumPy recarray so you can name the fields(columns). It has shape=(# of episodes,). You access each column by name.

A note on statistics: calculations are sensitive to the sum() operator's behavior over the range of values. If your data is well defined, the NumPy functions are appropriate. I investigated this a few years ago. See this discussion for all the details: when to use numpy vs statistics modules

Code to read and calculate statistics below.

import h5py
import numpy as np

def ExtractSimData(name, simulation_runs, length):

    # Call the simulation run file
    filename = f"runs/{length}/{name}simulation_runs2.h5"
    with h5py.File(filename, "a") as hf:
        # List all dataset names
        ds_names = list(hf.keys())
        print(f'Dataset names (keys): {ds_names}')

        # Create empty arrays for player1 and player2 episode values
        sim_cnt = len(ds_names)
        print(f'# of simulation runs (dataset count) = {sim_cnt}')
        ep_cnt = hf[ ds_names[0] ].shape[0]
        print(f'# of episodes (rows) in each dataset = {ep_cnt}')
        p1_arr = np.empty((ep_cnt,sim_cnt))
        p2_arr = np.empty((ep_cnt,sim_cnt))
        
        for i, ds in enumerate(hf.keys()): # each dataset is 1 simulation               
            p1_arr[:,i] = hf[ds][:,0]
            p2_arr[:,i] = hf[ds][:,1]
                
        ds1 = hf.create_dataset('final_data', shape=(ep_cnt,4), 
                          compression='gzip', chunks=True)   
        ds1[:,0] = np.mean(p1_arr, axis=1)
        ds1[:,1] = np.std(p1_arr, axis=1)
        ds1[:,2] = np.mean(p2_arr, axis=1)
        ds1[:,3] = np.std(p2_arr, axis=1)        

        dt = np.dtype([ ('average_player1',float), ('average_player2',float), 
                        ('std_player1',float), ('std_player2',float) ] )
        ds2 = hf.create_dataset('final_data_named', shape=(ep_cnt,), dtype=dt, 
                          compression='gzip', chunks=True)   
        ds2['average_player1'] = np.mean(p1_arr, axis=1)
        ds2['std_player1'] = np.std(p1_arr, axis=1)
        ds2['average_player2'] = np.mean(p2_arr, axis=1)
        ds2['std_player2'] = np.std(p2_arr, axis=1)        

### main ###
simulation_runs = 10
length='01'
name='test_'
ExtractSimData(name, simulation_runs, length)

Code to create pseudo-data HDF5 file below.

import h5py
import numpy as np

# Create some psuedo-test data
def test_simulation(i):
    players = 2
    periods = 1000

    # Define the simulation with some random data
    val_hist = np.random.random(periods*players).reshape(periods,players)    
    
    if i == 0:
        mode='w'
    else:
        mode='a'
    # Save simulation data (unique datasets)
    with h5py.File('runs/01/test_simulation_runs2.h5', mode) as hf:
        hf.create_dataset(f'data_{i:03}', data=val_hist, 
                          compression='gzip', chunks=True)

# Run the simulation N times
simulations = 10
for i in range(simulations):
    print(f'--- Iteration {i} ---')
    test_simulation(i)

Thank you very much for the suggestion and detailed documentation! This one works perfectly. I just combined the two single averages together with `combined_averages = list(zip(ds2['average_player1'] , ds2['average_player2']))` so I just have them in one list. — jey-ronimo, Aug 05 '21 at 08:05

Compute mean and standard deviation for HDF5 data

Main Routine

Storing of Simulation Data

Computing Statistics for Files

First Attempt to Compute Means

1 Answers1