3

I am using h5py to build a dataset. Since I want to store arrays with different #of rows dimension, I use the h5py special_type vlen. However, I experience behavior I can't explain, maybe you can me help in understanding what is happening:

>>>> import h5py
>>>> import numpy as np
>>>> fp = h5py.File(datasource_fname, mode='w') 
>>>> dt = h5py.special_dtype(vlen=np.dtype('float32'))
>>>> train_targets = fp.create_dataset('target_sequence', shape=(9549, 5,), dtype=dt)
>>>> test
Out[130]: 
array([[ 0.,  1.,  1.,  1.,  0.,  1.,  1.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.]])
>>>> train_targets[0] = test
>>>> train_targets[0]
Out[138]: 
array([ array([ 0.,  1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.], dtype=float32),
        array([ 1.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.], dtype=float32),
        array([ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.], dtype=float32),
        array([ 0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.], dtype=float32),
        array([ 0.,  1.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.], dtype=float32)], dtype=object)

I do expect the train_targets[0] to be of this shape, however I can't recognize the rows in my array. They seem to be totally jumbled about, however it is consistent. By which I mean that every time I try the above code, train_targets[0] looks the same.

To clarify: the first element in my train_targets, in this case test, has shape (5,11), however the second element might be of shape (5,38) which is why I use vlen.

Thank you for your help

Mat

Mathew
  • 307
  • 1
  • 11

1 Answers1

1

I think

train_targets[0] = test

has stored your (11,5) array as an F ordered array in a row of train_targets. According to the (9549,5) shape, that's a row of 5 elements. And since it is vlen, each element is a 1d array of length 11.

That's what you get back in train_targets[0] - an array of 5 arrays, each shape (11,), with values taken from test (order F).

So I think there are 2 issues - what a 2d shape means, and what vlen allows.


My version of h5py is pre v2.3, so I only get string vlen. But I suspect your problem may be that vlen only works with 1d arrays, an extension, so to speak, of byte strings.

Does the 5 in shape=(9549, 5,) have anything to do with 5 in the test.shape? I don't think it does, at least not as numpy and h5py see it.

When I make a file following the string vlen example:

>>> f = h5py.File('foo.hdf5')
>>> dt = h5py.special_dtype(vlen=str)
>>> ds = f.create_dataset('VLDS', (100,100), dtype=dt)

and then do:

ds[0]='this one string'

and look at ds[0], I get an object array with 100 elements, each being this string. That is, I've set a whole row of ds.

ds[0,0]='another'

is the correct way to set just one element.

vlen is 'variable length', not 'variable shape'. While the https://www.hdfgroup.org/HDF5/doc/TechNotes/VLTypes.html documentation is not entirely clear on this, I think you can store 1d arrays with shape (11,) and (38,) with vlen, but not 2d ones.


Actually, train_targets output is reproduced with:

In [54]: test1=np.empty((5,),dtype=object)
In [55]: for i in range(5):
    test1[i]=test.T.flatten()[i:i+11]

It's 11 values taken from the transpose (F order), but shifted for each sub array.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thx for the explanation @hpaulj. The solution lies in that one needs to explicitly set each vector, not the whole matrix. 2d indeed does not work. – Mathew Jun 09 '15 at 13:28