I need to store a list/array of strings in an HDF5 file using h5py
. These strings are variable length. Following the examples I find online, I have a script that works.
import h5py
h5File=h5py.File('outfile.h5','w')
data=['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()
However, when data
gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.
I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int
works.
import h5py
h5File=h5py.File('outfile.h5','w')
data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']
dt = h5py.special_dtype(vlen=str)
dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")
h5File.flush()
h5File.close()
That script gives the following output.
Created the dataset with numbers!
Traceback (most recent call last):
File "write_strings_to_HDF5_file.py", line 32, in <module>
dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
dset_id.write(h5s.ALL, h5s.ALL, data)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')
I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?
Thanks to anyone who has a suggestion!
If anyone wants to see the slowdown on the example that works, here's a test case.
import h5py
h5File=h5py.File('outfile.h5','w')
sentence=['this','is','a','sentence']
data = []
for i in range(10000):
data += sentence
print(len(data))
dt = h5py.special_dtype(vlen=str)
dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
dset[i] = word
h5File.flush()
h5File.close()