Can I use h5py to write strings to an HDF5 file in one line, rather than looping over entries?

Question

I need to store a list/array of strings in an HDF5 file using h5py. These strings are variable length. Following the examples I find online, I have a script that works.

import h5py
  
h5File=h5py.File('outfile.h5','w')

data=['this','is','a','sentence']

dt = h5py.special_dtype(vlen=str)

dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
    dset[i] = word

h5File.flush()
h5File.close()

However, when data gets very large, the write takes a long time as it's looping over each entry and inserting it into the file.

I thought I could do it all in one line, just as I would with ints or floats. But the following script fails. Note that I added some code to test that int works.

import h5py

h5File=h5py.File('outfile.h5','w')

data_numbers = [0, 1, 2, 3, 4]
data = ['this','is','a','sentence']

dt = h5py.special_dtype(vlen=str)

dset_num = h5File.create_dataset('numbers',(len(data_numbers),1),dtype=int,data=data_numbers)
print("Created the dataset with numbers!\n")

dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
print("Created the dataset with strings!\n")

h5File.flush()
h5File.close()

That script gives the following output.

Created the dataset with numbers!

Traceback (most recent call last):
  File "write_strings_to_HDF5_file.py", line 32, in <module>
    dset_str = h5File.create_dataset('words',(len(data),1),dtype=dt,data=data)
  File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/dataset.py", line 170, in make_new_dset
    dset_id.write(h5s.ALL, h5s.ALL, data)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 211, in h5py.h5d.DatasetID.write
  File "h5py/h5t.pyx", line 1652, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1713, in h5py.h5t.py_create
TypeError: No conversion path for dtype: dtype('<U8')

I've read the documentation about UTF-8 encoding and tried a number of variations on the above syntax but I seem to be missing some key point. Maybe it can't be done?

Thanks to anyone who has a suggestion!

If anyone wants to see the slowdown on the example that works, here's a test case.

import h5py

h5File=h5py.File('outfile.h5','w')

sentence=['this','is','a','sentence']
data = []

for i in range(10000):
    data += sentence

print(len(data))

dt = h5py.special_dtype(vlen=str)

dset = h5File.create_dataset('words',(len(data),1),dtype=dt)
for i,word in enumerate(data):
    dset[i] = word

h5File.flush()
h5File.close()

Another recent string writing question, https://stackoverflow.com/q/68430470/901925 — hpaulj, Jul 23 '21 at 14:50

score 2 · Accepted Answer · answered Jul 23 '21 at 15:27

Writing data 1 row at a time is the slowest way to write to an HDF5 file. You won't notice the performance issue when you write 100 rows, but you will see it as the number of rows increases. There is another answer that discusses that issue. See this: pytables writes much faster than h5py. Why? (Note: I am NOT suggesting you use PyTables. The linked answer shows performance for both h5py and PyTables). As you can see, it takes a lot longer longer to write the same amount of data when writing a lot of small chunks.

To improve performance, you need to write more data each time. Since you have all the data loaded in list data, you can do it in one shot. It will be nearly instantaneous for 10,000 rows. The answer referenced in the comments touches on this technique (creating a np.array() from the list data. However, it works from small lists (1/row)...so not exactly the same. You have to take care when you create the array. You can't use NumPy's default Unicode dtype -- it isn't supported by h5py. Instead, you need dtype='S#'

Code below show show to convert your list of strings to a np.array() of strings. Also, I highly recomend you use Python's with/as: contect manager to open the file. This avoids situations where the file is accidentally left open due to an unexpected exit (due to crash or logic error).

Code below:

import h5py
import numpy as np

sentence=['this','is','a','sentence']
data = []

for i in range(10_000):
    data += sentence
print(len(data))
longest_word=len(max(data, key=len))
print('longest_word=',longest_word)

dt = h5py.special_dtype(vlen=str)

arr = np.array(data,dtype='S'+str(longest_word))
with h5py.File('outfile.h5','w') as h5File:
    dset = h5File.create_dataset('words',data=arr,dtype=dt)
    print(dset.shape, dset.dtype)

I think the significant change is that you created a `S8`, bytestring array, whereas the OP's attempt using `data=data` was implicitly trying to save `data=np.array(data)`, resulting in a 'U8' array. `h5py` can't (at least for now) convert `numpy` unicode dtype to `vlen` strings. — hpaulj, Jul 23 '21 at 16:01
Ahhh...I started from the working example at the end, and just looked at the problem in the non-working example. Yes, the error is due to the 'auto-magic' internal conversion of his list of strings to a numpy array of unsupported Unicode data. The h5py limitation comes from HDF5. It doesn't support the wide characters. So, it applies to all string data -- both fixed and variable length -- the dtype has to be `"S#"` and not `" — kcw78, Jul 23 '21 at 16:21
Ah, this is great @kcw78!!!! Thank you!!! This is great! And thank you for the clear explanation of why the original code was failing. That was super helpful. — Matt Bellis, Jul 23 '21 at 18:09

Can I use h5py to write strings to an HDF5 file in one line, rather than looping over entries?

1 Answers1