2

I'm trying to set up and write to an HDF5 dataset using h5py (Python 3) that contains a one dimensional array of compound objects. Each compound object is made up of three variable length string properties.

     with h5py.File("myfile.hdf5", "a") as file:
         dt = np.dtype([
             ("label", h5py.string_dtype(encoding='utf-8')),
             ("name", h5py.string_dtype(encoding='utf-8')),
             ("id", h5py.string_dtype(encoding='utf-8'))])
         dset = file.require_dataset("initial_data", (50000,), dtype=dt)
         dset[0, "label"] = "foo"

When I run the example above, the last line of code causes h5py (or more accurately numpy) to throw an error saying:

"Cannot change data-type for object array."

Do I understand correctly that the type for "foo" is not h5py.string_dtype(encoding='utf-8')?

How come? And how can I fix this?

UPDATE 1: Stepping into the stacktrace, I can see that the error is thrown from an internal numpy function called _view_is_safe(oldtype, newtype). In my case oldtype is dtype('O') but newtype is dtype([('label', 'O')]) which causes the error to be thrown.

UPDATE 2: My question has been answered successfully below but for completeness I'm linking to a GH issue that might be related: https://github.com/h5py/h5py/issues/1921

kcw78
  • 7,131
  • 3
  • 12
  • 44
urig
  • 16,016
  • 26
  • 115
  • 184
  • 1
    "foo" is a Python string (unicode). So it will require some conversion, which presumably is what the error, with **traceback**, is telling us. Writing strings to `dataset` is an evolving feature, so I'm not familiar with the current details. I'd have to got to the docs, and work from their examples toward your case. – hpaulj Jul 18 '21 at 15:50

1 Answers1

2

You're setting the dtype as a tuple of variable length strings, so you'd set the tuple all at once. By only setting the label element, the other two tuple values aren't being set, so they are not string types.

Example:

import h5py
import numpy as np

with h5py.File("myfile.hdf5", "a") as file:
    dt = np.dtype([
        ("label", h5py.string_dtype(encoding='utf-8')),
        ("name", h5py.string_dtype(encoding='utf-8')),
        ("id", h5py.string_dtype(encoding='utf-8'))])
    dset = file.require_dataset("initial_data", (50000,), dtype=dt)

#Add a row of data with a tuple:
    dset[0] = "foo", "bar", "baz"
 
#Add another row of data with a np recarray (1 row):
    npdt = np.dtype([
        ("label", 'S4'),
        ("name", 'S4'),
        ("id", 'S4') ])
    dset[1] = np.array( ("foo1", "bar1", "baz1"), dtype=npdt )
       
#Add 3 rows of data with a np recarray (3 rows built from a list of arrays):
    s1 = np.array( ("A", "B", "C"), dtype='S4' )
    s2 = np.array( ("a", "b", "c"), dtype='S4' )
    s3 = np.array( ("X", "Y", "Z"), dtype='S4' )
    recarr = np.rec.fromarrays([s1, s2, s3], dtype=npdt)
    dset[2:5] = recarr

Result #1:
output

Result using all 3 methods:
enter image description here

kcw78
  • 7,131
  • 3
  • 12
  • 44
Abstract
  • 985
  • 2
  • 8
  • 18
  • Thank you. That works. Follow up question, if I may - Is a tuple the right way to define my compound object here? Or is there a different way that might enable me to set specific properties w/o having to set the entire object? – urig Jul 18 '21 at 16:17
  • 1
    @urig I'm interested in this myself, so I'll look into it. That said, in this case `None` is a valid type, so you can absolutely use something like `dset[0] = "foo", None, "baz"` to partially fill the set. The basic idea is that you're telling `h5py` to expect a 3-element tuple as the input, and anything besides a 3-element tuple breaks the `dtype` rule being set. – Abstract Jul 18 '21 at 16:20
  • 2
    @urig, @Abstract, you can also add data with numpy recarrays (which have several creation methods). "Best approach" really depends on your starting data structure. In other words, pick the simplest coding path to success. :-) Rather than add a new answer, I extended Abstract's example to show 2 more methods. Let me know if you prefer it's in a new answer. Also I deleted the duplicate `h5pyFile()` entries. – kcw78 Jul 18 '21 at 21:19
  • @kcw78 thank you. It's also very kind of you to augment the accepted answer. My use case is mapping from a python object to a dataset compound object entry. Would you agree that of the options you and @absolute have shown, `dset[0] = "foo", "bar", "baz"` is the best fit? – urig Jul 19 '21 at 07:03
  • 1
    Note: this procedure is "complicated" because you have compound data with variable length strings. It is simpler when you have compound data with "typical" Python/Numpy types (ints, floats, fixed length strings). – kcw78 Jul 19 '21 at 13:09
  • 1
    Regarding your question, I'm inclined to say yes based on your data (1d array of compound objects, each with 3 variable length strings). **However**, loading data 1 row at a time is the slowest possible way to do it. Performance might be acceptable for 50_000 rows. Check out this answer that shows I/O performance decreases as the size of the written data block gets smaller (and # of write calls increases). [Pytables writes faster than h5py](https://stackoverflow.com/a/57963340/10462884} – kcw78 Jul 19 '21 at 13:11
  • @kcw78 tx again. My use case is similar to a regular db so inserts are one at a time. Luckily the rate of writes will not be high and reads will not be one at a time. I looked at PyTables but they do not support variable length strings as far as I could tell...? – urig Jul 19 '21 at 13:27
  • 1
    Sorry, for the confusion. I'm not suggesting you use PyTables. That post was a question about PyTables vs h5py performance. I suspected the root cause was write block size (user was writing 64 rows at a time). So I studied speed vs write block performance, and created the graph. I don't know if PyTables has variable length strings. I don't use them in my code. I only work with them when answering SO questions. :-) – kcw78 Jul 19 '21 at 14:08