2

I'm attempting the following:

    SPECIAL_TYPE = np.dtype([("arr", h5py.special_dtype(vlen=np.uint8)),
                                 ("int1", np.uint8),
                                 ("str", h5py.special_dtype(vlen=str)),
                                 ("int2", np.uint8),
                                 ("int3", np.uint8),
                                 ("list", h5py.special_dtype(vlen=np.uint8)),
                                 ("int4", np.uint8)])
    db = f.create_dataset("db", (1,1), chunks=True, maxshape=(None, 1), dtype=SPECIAL_TYPE)

    db.resize((N,1))

    for i, idx in enumerate(range(N)):
        arr = np.zeros((3,3), dtype=np.uint8)

        db[i] = (arr,i, 'a', i, i, [0,1,2,3,4,5,6,7,8,9,10,11], i)

The code above fails for me because of the multidimensional array and the list elements of the tuple.

At best, it seems it only stores the first row of the array in the tuple (can't seem to fix this) while throwing an error when trying to store the list in the tuple.

Is there something I'm missing that would allow storing a list of tuples in this manner to work?

NOTE: I've come across these discussions:

1) https://github.com/h5py/h5py/issues/876

2) Inexplicable behavior when using vlen with h5py

and suspect that it's not possible to directly store the tuple as I would like (mainly due to the vlen potentially only working with 1-D arrays?).

Forgive any ignorance in this question...I'm a novice at HDF5.

Thanks!

John Cast
  • 1,771
  • 3
  • 18
  • 40
  • Why isn't `list` field `vlen`? I haven't played with `vlen` much except for the linked answer (and maybe 1 or 2 newer), but I think the resulting dtype has to be compatible with both numpy and hdf5. – hpaulj Jan 12 '18 at 12:49
  • You're correct. I originally had that and I've updated the code. The code you caught was just a product of my desperate debugging :P – John Cast Jan 12 '18 at 17:17

1 Answers1

4

With your dtype I can create an array:

In [37]: np.array([_],dtype=SPECIAL_TYPE)
Out[37]: 
array([ (array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=uint8), 1, 'a', 1, 1, list([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]), 1)],
      dtype=[('arr', 'O'), ('int1', 'u1'), ('str', 'O'), ('int2', 'u1'), ('int3', 'u1'), ('list', 'O'), ('int4', 'u1')])

But trying to create dataset with it, even 1d, dumps me out of the interpreter:

In [38]: f=h5py.File('vlentest.h5','w')
In [39]: db = f.create_dataset('db',(10,), dtype=SPECIAL_TYPE)
In [40]: db[:]
Segmentation fault (core dumped)

There two issues - does vlen work in a 2d array, and does it work in a compound dtype? You are pushing the bounds with multiple vlen in a dtype in a 2d array.

Have you seen documentation or examples using vlen in a compound dtype?

Notice how h5py implements the vlen in numpy - it defines those fields a 'O' object dtype. That stores a pointer in the array, not the variable length object itself. Normally object dtype arrays cannot be saved with h5py. But these fields must has some added annotation that h5py uses to translate the pointer into the kind of structure that HDF5 accepts.

Storing string datasets in hdf5 with unicode explores how a vlen str is stored.

Storing multidimensional variable length array with h5py


Experimenting, stating with something small

In [14]: f = h5py.File('temp.h5')

In [15]: db1 = f.create_dataset('db1',(5,), dtype=dt1)
In [16]: db2 = f.create_dataset('db2',(5,), dtype=dt2)
In [17]: db1[:]
Out[17]: 
array([('',), ('',), ('',), ('',), ('',)],
      dtype=[('str', 'O')])
In [18]: db2[:]
Out[18]: 
array([('', 0), ('', 0), ('', 0), ('', 0), ('', 0)],
      dtype=[('str', 'O'), ('int4', '<i4')])

Setting some db1 values:

In [24]: db1[0]=('a',)
In [25]: db1[1]=('ab',)
In [26]: db1[:]
Out[26]: 
array([('a',), ('ab',), ('',), ('',), ('',)],
      dtype=[('str', 'O')])

db2 works the same way:

In [30]: db2[0]=('abc',10)
In [31]: db2[1]=('abcde',6)
In [32]: db2[:]
Out[32]: 
array([('abc', 10), ('abcde',  6), ('',  0), ('',  0), ('',  0)],
      dtype=[('str', 'O'), ('int4', '<i4')])

2 vlen strings also work:

In [34]: dt3 = np.dtype([("str1", h5py.special_dtype(vlen=str)),("str2", h5py.special_dtype(vlen=str))])

In [35]: db3 = f.create_dataset('db3',(3,), dtype=dt3)
In [36]: db3[:]
Out[36]: 
array([('', ''), ('', ''), ('', '')],
      dtype=[('str1', 'O'), ('str2', 'O')])
In [37]: db3[0] = ('abc','defg')
In [38]: db3[1] = ('abcd','de')
In [39]: db3[:]
Out[39]: 
array([('abc', 'defg'), ('abcd', 'de'), ('', '')],
      dtype=[('str1', 'O'), ('str2', 'O')])

and with an array vlen

In [41]: dt4 = np.dtype([("str1", h5py.special_dtype(vlen=str)),("list", h5py.special_dtype(vlen=np.int))])
In [42]: dt4
Out[42]: dtype([('str1', 'O'), ('list', 'O')])
In [43]: db4 = f.create_dataset('db4',(3,), dtype=dt4)

In [47]: db4[0]=('abcdef',np.arange(5))
In [48]: db4[1]=('abc',np.arange(3))
In [49]: db4[:]
Out[49]: 
array([('abcdef', array([0, 1, 2, 3, 4])), ('abc', array([0, 1, 2])),
       ('', array([], dtype=int32))],
      dtype=[('str1', 'O'), ('list', 'O')])

but I can't use a list

In [50]: db4[2]=('abc',[1,2,3,4])
--------------------------------------------------------------------------
AttributeError: 'list' object has no attribute 'dtype'

h5py saves arrays, not lists. Apparently that applies to these nested values as well. http://docs.h5py.org/en/latest/special.html has examples of setting a vlen with a list, but it has first converted to an array.

If I try to save a 2d array, it only writes a 1d

In [59]: db4[2]=('abc',np.ones((2,2),int))
In [60]: db4[:]
Out[60]: 
array([('abcdef', array([0, 1, 2, 3, 4])), ('abc', array([0, 1, 2])),
       ('abc', array([1, 1]))],
      dtype=[('str1', 'O'), ('list', 'O')])

This dtype works:

In [21]: dt1 = np.dtype([("str1", h5py.special_dtype(vlen=str)),('f1',int),("list", h5py.special_dtype(vlen=np.int))])

This does the core dump

In [30]: dt1 = np.dtype([("f0", h5py.special_dtype(vlen=np.uint8)),('f1',int),("f2", h5py.special_dtype(vlen=np.int))])

Is this a vlen uint8 problem, or a problem with a vlen be first?

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for the hardwork! I've performed similar tinkering and find the same results. Upvoting for your thorough efforts! – John Cast Jan 30 '18 at 05:20