Creating a dataset from multiple hdf5 groups

Question

creating a dataset from multiple hdf5 groups

Code for groups with

np.array(hdf.get('all my groups'))

I have then added code for creating a dataset from groups.

with h5py.File('/train.h5', 'w') as hdf:
hdf.create_dataset('train', data=one_T+two_T+three_T+four_T+five_T)

The error message being

ValueError: operands could not be broadcast together with shapes(534456,4) (534456,14)

The numbers in each group are the same other than the varying column lengths. 5 separate groups to one dataset.

When you say "all my groups", do you mean datasets or groups? HDF5 stores data in datasets. Groups are similar to folders. I assume you want to concatenate data from multiple datasets (in 1 file) into a 1 dataset in another file. If so this can be done by looping over the group keys (dataset names), then copying each dataset to a numpy array, writing the array to the new file/dataset and repeating for each dataset. — kcw78, Feb 04 '21 at 20:20
You have datasets with different shapes: (534456,4) and (534456,14). Are the other datasets of compatible shape (534456, #)? If so, I assume the new dataset will append along the 1-axis with resulting shape of (534456,n1+n2+n3+n4+n5). Correct? Also, all datasets all need to have the same dtype (all floats or ints, etc). Do you need an example of how to do this? — kcw78, Feb 04 '21 at 23:16
yes, please, they are same. you are correct across the 1-axis. Yes they do have the same Dtype — David Johnson, Feb 05 '21 at 10:42

kcw78 · Answer 1 · 2021-02-10T16:04:55.693

Here you go; a simple example to copy values from 3 datasets in file1 to a single dataset in file2. I included some tests to verify compatible dtype and shape. The code to create file1 are included at the top. Comments in the code should explain the process. I have another post that shows multiple ways to copy data between 2 HDF5 files. See this post: How can I combine multiple .h5 file?

import h5py
import numpy as np
import sys

# Data for file1
arr1 = np.random.random(80).reshape(20,4)
arr2 = np.random.random(40).reshape(20,2)
arr3 = np.random.random(60).reshape(20,3)

#Create file1 with 3 datasets
with h5py.File('file1.h5','w') as h5f :
    h5f.create_dataset('ds_1',data=arr1)
    h5f.create_dataset('ds_2',data=arr2)
    h5f.create_dataset('ds_3',data=arr3)
 
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
     h5py.File('file2.h5','w') as h5f2 :

# Loop over datasets in file1 and check data compatiblity         
    for i, ds in enumerate(h5f1.keys()) :
        if i == 0:
            ds_0 = ds
            ds_0_dtype = h5f1[ds].dtype
            n_rows = h5f1[ds].shape[0]
            n_cols = h5f1[ds].shape[1]
        else:
            if h5f1[ds].dtype != ds_0_dtype :
                print(f'Dset 0:{ds_0}: dtype:{ds_0_dtype}')
                print(f'Dset {i}:{ds}: dtype:{h5f1[ds].dtype}')
                sys.exit('Error: incompatible dataset dtypes')

            if h5f1[ds].shape[0] != n_rows :
                print(f'Dset 0:{ds_0}: shape[0]:{n_rows}')
                print(f'Dset {i}:{ds}: shape[0]:{h5f1[ds].shape[0]}')
                sys.exit('Error: incompatible dataset shape')

            n_cols += h5f1[ds].shape[1]
        prev_ds = ds    

# Create new empty dataset with appropriate dtype and size
# Using maxshape paramater to make resizable in the future
    h5f2.create_dataset('ds_123', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
    
# Loop over datasets in file1, read data into xfer_arr, and write to file2        
    first = 0
    for ds in h5f1.keys() :
        xfer_arr = h5f1[ds][:]
        last = first + xfer_arr.shape[1]
        h5f2['ds_123'][:, first:last] = xfer_arr[:]
        first = last

Thank you. One slight problem was the order of the datasets when combining. Like you have ordered them ds_1, ds_2 and ds_3. Combing them worked but the created ds_123 dataset was in what looks to be a random order. ds_2, ds_1, ds_3. any ideas? — David Johnson, Feb 08 '21 at 17:19
I didn't define the read order. The datasets are processed in the order `h5f1.keys()` produces the names/keys. (In my test, they are processed in 1,2,3 order; but that could be dumb luck). If you know the names a priori, you handle this with a list: `ds_list = ['ds_1', 'ds_2', ds_3']`, then replace `h5f1.keys()` with `ds_list` and you're set. It's harder to control the order if you don't know the names. To process in alphabetical order, create a list from the keys, then use `.sort()` on the list. — kcw78, Feb 08 '21 at 21:43
Thank you, I do know the list. Is it possible to drop columns from specific lists? — David Johnson, Feb 09 '21 at 09:59
Please clarify. Do you want to copy a subset of fields (columns) from each dataset? For example, Col_1 and Col_2 from ds_1, Col_2 from ds_2, and Col_3 and Col_4 from ds_3? If so, you can do that by modifying the `xfer_arr` slice. Using `[:]` reads the entire dataset. To only read the first 2 columns of a 2d array, change to `[0:2,:]`. If you do this, you need to be precise with slice notation (reading and writing). — kcw78, Feb 09 '21 at 15:29
an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns. — David Johnson, Feb 10 '21 at 10:57
I created a new answer (below) that copies some columns of each dataset (to avoid confusion with "simple" answer above). — kcw78, Feb 10 '21 at 15:05

score 0 · Accepted Answer · answered Feb 10 '21 at 15:59

This answer addresses the OP's request in comments to my first answer ("an example would be ds_1 all columns, ds_2 first two columns, ds_3 column 4 and 6, ds_4 all columns"). The process is very similar, but the input is "slightly more complicated" than the first answer. As a result I used a different approach to define dataset names and colums to be copied. Differences:

The first solution iterates over the dataset names from the "keys()" (copying each dataset completely, appending to a dataset in the new file). The size of the new dataset is calculated by summing sizes of all datasets.
The second solution uses 2 lists to define 1) dataset names (ds_list) and 2) associated columns to copy from each dataset (col_list is a of lists). The size of the new dataset is calculated by summing the number of columns in col_list. I used "fancy indexing" to extract the columns using col_list.
How you decide to do this depends on your data.
Note: for simplicity, I deleted the dtype and shape tests. You should include these to avoid errors with "real world" problems.

Code below:

# Data for file1
arr1 = np.random.random(120).reshape(20,6)
arr2 = np.random.random(120).reshape(20,6)
arr3 = np.random.random(120).reshape(20,6)
arr4 = np.random.random(120).reshape(20,6)

# Create file1 with 4 datasets
with h5py.File('file1.h5','w') as h5f :
    h5f.create_dataset('ds_1',data=arr1)
    h5f.create_dataset('ds_2',data=arr2)
    h5f.create_dataset('ds_3',data=arr3)
    h5f.create_dataset('ds_4',data=arr4)
 
# Open file1 for reading and file2 for writing
with h5py.File('file1.h5','r') as h5f1 , \
     h5py.File('file2.h5','w') as h5f2 :

# Loop over datasets in file1 to get dtype and rows (should test compatibility)        
     for i, ds in enumerate(h5f1.keys()) :
        if i == 0:
            ds_0_dtype = h5f1[ds].dtype
            n_rows = h5f1[ds].shape[0]
            break

# Create new empty dataset with appropriate dtype and size
# Use maxshape parameter to make resizable in the future

    ds_list = ['ds_1','ds_2','ds_3','ds_4']
    col_list =[ [0,1,2,3,4,5], [0,1], [3,5], [0,1,2,3,4,5] ]
    n_cols = sum( [ len(c) for c in col_list])
    h5f2.create_dataset('combined', dtype=ds_0_dtype, shape=(n_rows,n_cols), maxshape=(n_rows,None))
    
# Loop over datasets in file1, read data into xfer_arr, and write to file2        
    first = 0  
    for ds, cols in zip(ds_list, col_list) :
        xfer_arr = h5f1[ds][:,cols]
        last = first + xfer_arr.shape[1]
        h5f2['combined'][:, first:last] = xfer_arr[:]
        first = last

Creating a dataset from multiple hdf5 groups

2 Answers2

Linked