5

I expected the following code to work, but it doesn't.

import h5py
import numpy as np

with h5py.File('file.hdf5','w') as hf:
    dt = h5py.special_dtype(vlen=str)
    feature_names = np.array(['a', 'b', 'c'])
    hf.create_dataset('feature names', data=feature_names, dtype=dt)

I get the error message TypeError: No conversion path for dtype: dtype('<U1'). The following code does work, but using a for loop to copy the data seems a bit clunky to me. Is there a more straightforward way to do this? I would prefer to be able to pass the sequence of strings directly into the create_dataset function.

import h5py
import numpy as np

with h5py.File('file.hdf5','w') as hf:
    dt = h5py.special_dtype(vlen=str)
    feature_names = np.array(['a', 'b', 'c'])
    ds = hf.create_dataset('feature names', (len(feature_names),), dtype=dt)

    for i in range(len(feature_names)):
        ds[i] = feature_names[i]

Note: My question follows from this answer to Storing a list of strings to a HDF5 Dataset from Python, but I don't consider it a duplicate of that question.

mhwombat
  • 8,026
  • 28
  • 53
  • Define "straightforward." Your loop that works is about as "straightforward" as it gets. – Robert Harvey Mar 21 '19 at 14:22
  • @RobertHarvey I was hoping that there was a Python type that I could use for my sequence/list/vector of variable-length strings, that could directly be used by `hp5y`. – mhwombat Mar 21 '19 at 14:41
  • Does `ds[:] = feature_names` work? Or `data=feature_names.astype(object)`? – hpaulj Mar 21 '19 at 16:07
  • @hpaulj `ds[:] = feature_names` works, but your second option doesn't. If you want to turn that into an answer, I'll vote it up. Also, I'll accept it unless someone comes up with a way to pass the list into the `create_dataset` function. – mhwombat Mar 21 '19 at 16:45

1 Answers1

10

You almost did it, the missing detail was to pass dtype to np.array:

import h5py                                                                                                                                                                                                
import numpy as np            

with h5py.File('file.hdf5','w') as hf: 
     dt = h5py.special_dtype(vlen=str) 
     feature_names = np.array(['a', 'b', 'c'], dtype=dt) 
     hf.create_dataset('feature names', data=feature_names)

PS: It looks like a bug for me - create_dataset ignores the given dtype and don't apply it to the given data.

teegaar
  • 876
  • 9
  • 19