0

I am using h5py package to create HDF5 file for my training set.

I want to create the first column having a variable length. For example, [1,2,3] as 1st entry in the column, [1,2,3,4,5] as 2nd entry and so on leaving other 5 columns in the same dataset in HDF5 file with data type int with a fixed length, i.e. 1.

I have tried the below code statement to solve this type of scenario:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
datatype = np.dtype([('FieldA', dt), ('FieldB', dt1), ('FieldC', dt1), ('FieldD', dt1), ('FieldE', dt1), ('FieldF', dt1)])

But, in the output, I got only empty array for each of the columns stated above for this dataset.

And, when I tried the below code:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
data = db.create_dataset("data1", (5000,), dtype=dt)

This only gives me one column with variable length entries in the dataset but I want all these 6 columns to be included in the same dataset but with 1st column as having variable length entries like stated above.

I am totally confused as to how to get a solution for this type of scenario. Any help would highly be appreciated.

kcw78
  • 7,131
  • 3
  • 12
  • 44
ANIKET SAXENA
  • 33
  • 1
  • 6

2 Answers2

2

Do you want variable length (ragged) columns, or just a column that can hold an array of data (up to the dtype limit)? The second is pretty straight forward. See the code below. (It's a simple example with 2 fields to demonstrate the method.)

my_dt = np.dtype([('FieldA', 'int32', (4,)), ('FieldB', 'int32') ] )


with h5py.File('SO_57260167.h5','w') as h5f :

    data = h5f.create_dataset("testdata", (10,), dtype=my_dt)

    for cnt in range(10) :
        arr = np.random.randint(1,1000,size=4)
        print (arr)
        data[cnt,'FieldA']=arr
        data[cnt,'FieldB']=arr[0]
        print (data[cnt]['FieldB'])

If you want a variable length ("ragged") column, I'm 99% sure you are limited to a single column when using the special dtype in a dataset. Also, I don't think you can name the fields/columns. (I couldn't get it to work, and couldn't find any examples.)
Code below shows example above modified to put variable column data in data set vl_data and the rest of the integer data in data set fx_data.

vl_dt = h5py.special_dtype(vlen=np.dtype('int32'))
my_dt = np.dtype([('FieldB', 'int32'), ('FieldC', 'int32'), ('FieldD', 'int32'), 
                  ('FieldE', 'int32'), ('FieldF', 'int32')])

with h5py.File('SO_57260167_vl.h5','w') as h5f :

    vl_data = h5f.create_dataset("testdata_vl", (10,), dtype= vl_dt)
    fx_data = h5f.create_dataset("testdata", (10,), dtype=my_dt )

    for cnt in range(10) :
        arr = np.random.randint(1,1000,size=cnt+2)
#        print (arr)
        vl_data[cnt]=arr
        print (vl_data[cnt])
        fx_data[cnt,'FieldB']=arr[0]
        fx_data[cnt,'FieldF']=arr[-1]
        print (fx_data[cnt])
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Thanks. Yes, you are right that I want only first column of the dataset as having variable length ("ragged"). For example, 1st entry as [1,2,3], 2nd entry as [1,2,3,4,5] and so on with another 5 columns. So, in total, I should have 6 columns in the dataset, e.g. I want [[1,2,3],45,22,2,2,1] as one of the entries in the dataset where [1,2,3] corresponds to 1st column, '45' as 2nd, '22' as 3rd, '2' as 4th, '2' as 5th, and '1' as 6th column. The first code you have provided will fail for this type of input ([1,2,3]) because it does't have size=4. So, can you help me on this? – ANIKET SAXENA Jul 30 '19 at 06:51
  • Aniket, I don't think you can do what you describe. ( "_I want all 6 columns to be included in the same dataset, 1st column has variable length entries like stated above._" ). I can't find any examples that show how to do it with `h5py` or `pytables`. From my tests, you have to put the ragged array in 1 dataset, and the other (fixed size) data in another dataset. Or, if you know the max size of the largest ragged array, you can dimension the first column to that size, then pad smaller arrays with zeros for missing values (and save the size as a field). – kcw78 Jul 30 '19 at 13:58
  • Thanks for your valuable response. I think you are right that this can't be possible via HDF5. But, I think this padding with zeros to my input with Masking layer in Keras can be a great solve for this scenario. Anyway, can you please check out my question on padding and masking at: – https://stackoverflow.com/questions/49670832/keras-lstm-with-masking-layer-for-variable-length-inputs?rq=1 I think you can give a solve for this question. Please help. – ANIKET SAXENA Jul 31 '19 at 15:34
  • I think you meant this question: [how-to-deal-with-variable-length-sequences-in-keras-for-mlp](https://stackoverflow.com/questions/57284575/). I read and added some comments. However I am not familiar with masked array usage in Keras. The link in your comment above describes "how to use masks" with Keras. – kcw78 Jul 31 '19 at 16:30
  • @kcw78 I think you can name the field like this: tmp = h5py.special_dtype(vlen=np.dtype('uint8')) dt = np.dtype([('elements', tmp)]) dset = h5_file.create_dataset("Var Length", shape=(1,), maxshape=(None,), chunks=True, dtype=dt) – ficus Aug 12 '20 at 07:20
0

I just posted this answer because it took a little digging (having a single named variable length column). You can also have named "ragged" column and set it. like this:

dt = h5py.special_dtype(vlen=np.dtype('int32'))
dset = h5_file.create_dataset("some_data", shape=(2,), maxshape=(None,), chunks=True, dtype=np.dtype([('name_var_lngth',dt),]))
array_test = np.array([1,2,3,4,5], np.dtype('int32'))
dset[0]= (array_test,)
array_test = np.array([1,2,3,4,5,6], np.dtype('int32'))
dset[1]= (array_test,)

Trying to set a field does not work, you can only set the whole record like observed by others: Writing to compound dataset with variable length string via h5py (HDF5)

ficus
  • 55
  • 6