13

I am trying to store about 3000 numpy arrays using HDF5 data format. Arrays vary in length from 5306 to 121999 np.float64

I am getting Object dtype dtype('O') has no native HDF5 equivalent error since due to the irregular nature of the data numpy uses the general object class.

My idea was to pad all the arrays to 121999 length and storing the sizes in another dataset.

However this seems quite inefficient in space, is there a better way?

EDIT: To clarify, I want to store 3126 arrays of dtype = np.float64. I have them stored in a listand when h5py does the routine it converts to an array of dtype = object because they are different lengths. To illustrate it:

a = np.array([0.1,0.2,0.3],dtype=np.float64)
b = np.array([0.1,0.2,0.3,0.4,0.5],dtype=np.float64)
c = np.array([0.1,0.2],dtype=np.float64)

arrs = np.array([a,b,c]) # This is performed inside the h5py call
print(arrs.dtype)
>>> object
print(arrs[0].dtype)
>>> float64
  • Are you trying to save one array with 3000 subarrays (with dtype object), or 3000 arrays, each with dtype float? Give a small example with 2 or 3 arrays. – hpaulj May 13 '16 at 16:46
  • I clarified it with in the Edit – Jose Javier Gonzalez Ortiz May 13 '16 at 18:35
  • `arrs` is an object array which `h5py` can't save. You have to save `a`, `b`, `c` as separate `datasets`. Those arrays will be elements of a `datagroup`, and you may be able to use a dictionary interface with groups. – hpaulj May 13 '16 at 19:05

2 Answers2

21

Looks like you tried something like:

In [364]: f=h5py.File('test.hdf5','w')    
In [365]: grp=f.create_group('alist')

In [366]: grp.create_dataset('alist',data=[a,b,c])
...
TypeError: Object dtype dtype('O') has no native HDF5 equivalent

But if instead you save the arrays as separate datasets it works:

In [367]: adict=dict(a=a,b=b,c=c)

In [368]: for k,v in adict.items():
    grp.create_dataset(k,data=v)
   .....:     

In [369]: grp
Out[369]: <HDF5 group "/alist" (3 members)>

In [370]: grp['a'][:]
Out[370]: array([ 0.1,  0.2,  0.3])

and to access all the datasets in the group:

In [389]: [i[:] for i in grp.values()]
Out[389]: 
[array([ 0.1,  0.2,  0.3]),
 array([ 0.1,  0.2,  0.3,  0.4,  0.5]),
 array([ 0.1,  0.2])]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
6

Clean method for variable length internal arrays: http://docs.h5py.org/en/latest/special.html?highlight=dtype#arbitrary-vlen-data

hdf5_file = h5py.File('yourdataset.hdf5', mode='w')
dt = h5py.special_dtype(vlen=np.dtype('float64'))
hdf5_file.create_dataset('dataset', (3,), dtype=dt)
hdf5_file['dataset'][...] = arrs

print (hdf5_file['dataset'][...])
>>>array([array([0.1,0.2,0.3],dtype=np.float64), 
>>>array([0.1,0.2,0.3,0.4,0.5],dtype=np.float64, 
>>>array([0.1,0.2],dtype=np.float64], dtype=object)

Only works for 1D arrays, https://github.com/h5py/h5py/issues/876

Joshua Lim
  • 315
  • 3
  • 9