1

I'm trying to store a list of variable length arrays in an HDF file with the following procedure:

phn_mfccs = []

# Import wav files
for waveform in files:
    phn_mfcc = mfcc(waveform) # produces a variable length multidim array of the shape (x, 13, 1)              

    # Add MFCC and label to dataset
    # phn_mfccs has dimension (len(files),)
    # phn_mfccs[i] has variable dimension ([# of frames in ith segment] (variable), 13, 1)
    phn_mfccs.append(phn_mfcc) 

dt = h5py.special_dtype(vlen=np.dtype('float64'))
mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)

It seems like my datatypes aren't working out though -- instead of each element of the mfccs_out dataset containing a multidimensional array, it contains just a 1D array. e.g. if the first phn_mfcc I append originally has dimension (59,13,1), mfccs_out['phn_mfccs'][0] has dimension (59,). I suspect it is because I'm just using a float64 datatype, and I need something else for an array of arrays? If I don't specify the dataset or try to use dtype='O', though, it spits out an error like "Object dtype 'O' has no native HDF equivalent."

Ideally, what I'd like is for mfccs_out['phn_mfccs'][i] to contain the ith phn_mfcc that I appended to the list phn_mfccs.

Jess
  • 1,515
  • 3
  • 23
  • 32

1 Answers1

0

The essence of your code is:

phn_mfccs = []
<loop several layers>
    phn_mfcc = <some sort of array expanded by one dimension>
    phn_mfccs.append(phn_mfcc) 

At the end of loops phn_mfccs is a list of arrays. I can't tell from the code what the dtype and shape is. Or whether it differs for each element of the list.

I'm not entirely sure what create_dataset does when given a list of arrays. It may wrap it in np.array.

mfccs_out.create_dataset('phn_mfccs', data=phn_mfccs, dtype=dt)

What does np.array(phn_mfccs) produce? Shape, dtype? If all the elements are arrays of the same shape and dtype it will produce a higher dimensional array. If they differ in shape, it will produce a 1d array with object dtype. Given the error message, I suspect the latter.

I've answered a few vlen questions but haven't worked with it a lot

http://docs.h5py.org/en/latest/special.html

I vaguely recall that the 'ragged' dimension of a h5 array can only be 1d. So a phn_mfccs object array that contains 1d float arrays of varying dimensions might work.

I might come up with a simple example. And I suggest you construct a simpler problem that we can copy-n-paste and experiement with. We don't need to know how you read the data from your directory. We just need to understand the content of the array (list) that you are trying to write.

A 2015 post on vlen arrays

Inexplicable behavior when using vlen with h5py

H5PY - How to store many 2D arrays of different dimensions

1d ragged arrays example

In [24]: f = h5py.File('vlen.h5','w')
In [25]: dt = h5py.special_dtype(vlen=np.dtype('float64'))
In [26]: dataset = f.create_dataset('vlen',(4,), dtype=dt)
In [27]: dataset.value
Out[27]: 
array([array([], dtype=float64), array([], dtype=float64),
       array([], dtype=float64), array([], dtype=float64)], dtype=object)
In [28]: for i in range(4):
    ...:     dataset[i]=np.arange(i+3)

In [29]: dataset.value
Out[29]: 
array([array([ 0.,  1.,  2.]), array([ 0.,  1.,  2.,  3.]),
       array([ 0.,  1.,  2.,  3.,  4.]),
       array([ 0.,  1.,  2.,  3.,  4.,  5.])], dtype=object)

If I try to write 2d arrays to dataset I get an error

OSError: Can't prepare for writing data (Src and dest data spaces have different sizes)

The dataset itself may be multidimensional, but the vlen object has to be a 1d array of floats.

Community
  • 1
  • 1
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thanks for cleaning up the code (I'll edit the details since as you mentioned there's a lot of extraneous stuff there). ```np.array(phn_mfccs)``` does indeed produce an array of dtype 'O' with dimension ([# of times we append],). Each of the ```phn_mfcc``` elements has a different dimension. Is there any way to store something like this with HDF? If I try to specify explicitly in the arguments of ```create_dataset``` that I want a dataset with dtype='O', it throws the error I mentioned above. – Jess Mar 07 '17 at 21:58
  • The outer array, `phn_mfcc` can be object dtype, but I think the objects themselves need to be 1d arrays. I'll experiment. – hpaulj Mar 07 '17 at 22:05
  • Is it conclusively hopeless, then? Would you recommend another library to deal with this particular dataset? – Jess Mar 07 '17 at 22:29
  • I could try flattening the array, and storing some sort of shapes information in an attribute or other dataset. – hpaulj Mar 07 '17 at 23:34