0

Hi everyone, I try to append data to a dataset with h5py, and it seem that it doesn't work, I try to find why numpy_arr is generators that yield a structured numpy array who look like this :


dt_vlstr = h5py.string_dtype(encoding='utf-8')
    dt_vlstr_arr = h5py.vlen_dtype(dt_vlstr) 
    dt_int = np.dtype('i4')

    constructor = {
    'names':['ark','dates','events','iso639','views','dublincore','catch_word','colors', 'bw'],
    'formats': [
    dt_vlstr, #ark
    dt_vlstr_arr, #dates
    dt_vlstr_arr, #events
    dt_vlstr_arr, #iso639
    dt_int, #lecture
    dt_vlstr, #dublincore
    dt_vlstr_arr, #catch_word
    dt_int, #colors
    dt_int, #bw
    ]}

    compound = np.dtype(constructor)

def mapping2numpy(generators):
    for i in (generators):
        numpy_arr = np.array([(
                            i['ark'],
                            i['dates'].astype(dt_vlstr), 
                            i['events'].astype(dt_vlstr), 
                            i['iso639'].astype(dt_vlstr), 
                            i['views'], 
                            i['dublincore'], 
                            i['catch_word'].astype(dt_vlstr),
                            i['colors'],
                            i["bw"])],dtype=compound)

        yield numpy_arr

numpy_arr = mapping2numpy(data)

with h5py.File('file.h5', 'w') as h5f:
    group = h5f.create_group('metadata')
    dataset = group.create_dataset('records', (1,1), maxshape=(None,1), 
                                                                    compression="lzf", 
                                                                    dtype=compound, 
                                                                    fletcher32=True,
                                                                    chunks=(1,1))

with h5py.File('file.h5', 'a') as h5f: 
    dset = h5f['metadata/records']
    for data in numpy_arr:
        dset.resize( (dset.shape[0]+1, 1) )
        dset[-1,:] = data


  • `doesn't work` doesn't tell us anything useful! Do you want me to vote to close for lack of debugging details? – hpaulj Mar 18 '23 at 22:28
  • Just to be clear, is `numpy_arr` a structured array, for which you showed one record? What is `numpy_arr.dtype` then? – hpaulj Mar 18 '23 at 23:48
  • Please add the error message. Your process "should" work. However adding data 1 row at a time to a compound dataset **can be very slow**. If you need to add a lot of rows, it would be much better/faster to create large np.arrays for loading (max you can hold in memory). – kcw78 Mar 19 '23 at 01:35
  • There is no traceback, no errors the dataset is just empty at the end – Gustave Turrell Mar 19 '23 at 10:32

1 Answers1

0

It's hard to completely diagnose your process without your data. That said, I see several little things that could cause problems. Here's what I noticed:

  1. When you create the dataset, you set shape=(1,1) and maxshape=(None,1). There are 2 issues with this: a. For a compound dataset, shape (and maxshape) should be a single tuple, e.g.: shape=(1,) and maxshape=(None,)
    b. You created the dataset with an empty row, then add a row when your add data. So, the 1st row is always empty. Not an error, but it would be better to set shape=(0,) when you create the dataset.
  2. chunks needs to match shape, but don't set chunks=(1,). I left it out and let h5py set the default chunk size. Chunk size controls I/O performance, and this is the smallest possible chunk size you could request. It could create a severe I/O performance bottleneck.

Here is an example using your generator. I simplified your example to only have 3 fields (1 each with dtype=dt_vlstr, dt_vlstr_arr, and int).

def mapping2numpy(generators):
    for i in (generators):
        numpy_arr = np.array([(
                            i['ark'],
                            i['dates'].astype(dt_vlstr), 
                            i['views'])], 
                            dtype=compound)

        yield numpy_arr


dt_vlstr = h5py.string_dtype(encoding='utf-8')
dt_vlstr_arr = h5py.vlen_dtype(dt_vlstr) 
dt_int = np.dtype('i4')

constructor = {'names':['ark','dates','views'],
'formats': [dt_vlstr, #ark
            dt_vlstr_arr, #dates
            dt_int, #lecture / views
            ]}            
compound = np.dtype(constructor)

# using generator to convert data:
data_arr1 = np.empty((3,), dtype=compound)
data_arr1[:]['ark'] = ['row_0', 'row_1_long', 'row_2_longest']
data_arr1[:]['dates'] = [np.array([['a', 'bbb'],['cc', 'ddd']]),
                        np.array([['i', 'jjj'],['kk', 'lll']]),
                        np.array([['w', 'xxx'],['yy', 'zzz']]) ]
data_arr1[:]['views'] = [i for i in range(1,4)]

numpy_arr = mapping2numpy(data_arr1)

with h5py.File('file.h5', 'w') as h5f:
    group = h5f.create_group('metadata')
    group.create_dataset('records1', (0,), maxshape=(None,), dtype=compound)
                                    
with h5py.File('file.h5', 'a') as h5f: 
    dset = h5f['metadata/records1']
    for i, data in enumerate(numpy_arr):
        dset.resize((dset.shape[0]+1,))
        dset[i] = data
    print(dset[:])
    print(dset.chunks)

I don't know why you wrote the generator. I suspect you need to convert some data object types the the variable length string and array types. As I mentioned in my comment, loading a dataset row-by-row is the slowest way you can do this. It won't matter if you are only loading a 1000 rows. However the process will be very slow if you need to load a lot of rows (10e6). See this Q&A for details: pytables writes much faster than h5py. Why? Ignore the part about PyTables and focus on the performance issue when you write frequently with a small number of rows.

Here is my example modified to show writing directly from a recarray, both row-by-row (to records2) and all at once (to records3). I highly recommend the last method (but with more than 3 rows at a time). It continues from code above.

# loading data directly from numpy recarray:
data_arr2 = np.empty((3,), dtype=compound)
data_arr2[:]['ark'] = np.array(['row_0', 'row_1_long', 'row_2_longest'], dtype=dt_vlstr)
data_arr2[:]['dates'] = [np.array([['a', 'bbb'],['cc', 'ddd']]).astype(dt_vlstr),
                        np.array([['i', 'jjj'],['kk', 'lll']]).astype(dt_vlstr),
                        np.array([['w', 'xxx'],['yy', 'zzz']]).astype(dt_vlstr) ]
data_arr2[:]['views'] = [i for i in range(1,4)]
    
# loading row-by-row -- NOT recommended
with h5py.File('file.h5', 'a') as h5f: 
    h5f['metadata'].create_dataset('records2', (0,), maxshape=(None,), dtype=compound)
    dset = h5f['metadata/records2']
    for i, data in enumerate(data_arr2):
        dset.resize((dset.shape[0]+1,))
        dset[i] = data
    print(dset[:])
    print(dset.chunks)   
    
# loading all at once -- preferred method
with h5py.File('file.h5', 'a') as h5f:     
    h5f['metadata'].create_dataset('records3', data=data_arr2, maxshape=(None,))
    print(h5f['metadata/records3'][:])   
    print(h5f['metadata/records3'].chunks)   
kcw78
  • 7,131
  • 3
  • 12
  • 44