Storing a list of strings to a HDF5 Dataset from Python using VL format

Question

I expected the following code to work, but it doesn't.

import h5py
import numpy as np

with h5py.File('file.hdf5','w') as hf:
    dt = h5py.special_dtype(vlen=str)
    feature_names = np.array(['a', 'b', 'c'])
    hf.create_dataset('feature names', data=feature_names, dtype=dt)

I get the error message TypeError: No conversion path for dtype: dtype('<U1'). The following code does work, but using a for loop to copy the data seems a bit clunky to me. Is there a more straightforward way to do this? I would prefer to be able to pass the sequence of strings directly into the create_dataset function.

import h5py
import numpy as np

with h5py.File('file.hdf5','w') as hf:
    dt = h5py.special_dtype(vlen=str)
    feature_names = np.array(['a', 'b', 'c'])
    ds = hf.create_dataset('feature names', (len(feature_names),), dtype=dt)

    for i in range(len(feature_names)):
        ds[i] = feature_names[i]

Note: My question follows from this answer to Storing a list of strings to a HDF5 Dataset from Python, but I don't consider it a duplicate of that question.

Define "straightforward." Your loop that works is about as "straightforward" as it gets. — Robert Harvey, Mar 21 '19 at 14:22
@RobertHarvey I was hoping that there was a Python type that I could use for my sequence/list/vector of variable-length strings, that could directly be used by `hp5y`. — mhwombat, Mar 21 '19 at 14:41
Does `ds[:] = feature_names` work? Or `data=feature_names.astype(object)`? — hpaulj, Mar 21 '19 at 16:07
@hpaulj `ds[:] = feature_names` works, but your second option doesn't. If you want to turn that into an answer, I'll vote it up. Also, I'll accept it unless someone comes up with a way to pass the list into the `create_dataset` function. — mhwombat, Mar 21 '19 at 16:45

score 10 · Accepted Answer · answered Jul 03 '19 at 13:23

You almost did it, the missing detail was to pass dtype to np.array:

import h5py                                                                                                                                                                                                
import numpy as np            

with h5py.File('file.hdf5','w') as hf: 
     dt = h5py.special_dtype(vlen=str) 
     feature_names = np.array(['a', 'b', 'c'], dtype=dt) 
     hf.create_dataset('feature names', data=feature_names)

PS: It looks like a bug for me - create_dataset ignores the given dtype and don't apply it to the given data.

Thanks for this answer, saves me several hours! – Li Wang Nov 12 '21 at 07:14 — Li Wang, Nov 12 '21 at 07:14

Storing a list of strings to a HDF5 Dataset from Python using VL format

1 Answers1