1

I recently learned of the hdf5 compression and working with it. That it has some advantages over .npz/npy when working with gigantic files. I managed to try out a small list, since I do sometimes work with lists that have strings as follows;

def write():
    test_array = ['a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2', 'a1','a2','a1','a2','a1','a2', 'a1','a2']
    

    with  h5py.File('example_file.h5', 'w') as f:
        f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9) 
        f.close()
    

However I got this error:

f.create_dataset('test3', data=repr(test_array), dtype='S', compression='gzip', compression_opts=9)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/group.py", line 136, in create_dataset
    dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 118, in make_new_dset
    tid = h5t.py_create(dtype, logical=1)
  File "h5py/h5t.pyx", line 1634, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1656, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1689, in h5py.h5t.py_create
  File "h5py/h5t.pyx", line 1508, in h5py.h5t._c_string
ValueError: Size must be positive (size must be positive)

After searching for hours over the net on any better ways to do this, I couldn't get. Is there a better way to compress lists with H5?

kcw78
  • 7,131
  • 3
  • 12
  • 44
lobjc
  • 2,751
  • 5
  • 24
  • 30

2 Answers2

2

This is a more general answer for Nested Lists where each nested list is a different length. It also works for the simpler case when the nested lists are equal length. There are 2 solutions: 1 with h5py and one with PyTables.

h5py example
h5py does not support ragged arrays, so you have to create a dataset based on the longest substring and add elements to the "short" substrings. You will get 'None' (or a substring) at each array position that doesn't have a corresponding value in the nested list. Take care with the dtype= entry. This shows how to find the longest string in the list (as slen=##) and uses it to create dtype='S##'

import h5py
import numpy as np

test_list = [['a01','a02','a03','a04','a05','a06'], 
             ['a11','a12','a13','a14','a15','a16','a17'], 
             ['a21','a22','a23','a24','a25','a26','a27','a28']]

# arrlen and test_array from answer to SO #10346336 - Option 3:
# Ref: https://stackoverflow.com/a/26224619/10462884    
slen = max(len(item) for sublist in test_list for item in sublist)
arrlen = max(map(len, test_list))
test_array = np.array([tl+[None]*(arrlen-len(tl)) for tl in test_list], dtype='S'+str(slen))
  
with h5py.File('example_nested.h5', 'w') as f:
     f.create_dataset('test3', data=test_array, compression='gzip')

PyTables example
PyTables supports ragged 2-d arrays as VLArrays (variable length). This avoids the complication of adding 'None' values for "short" substrings. Also, you don't have to determine the array length in advance, as the number of rows is not defined when VLArray is created (rows are added after creation). Again, take care with the dtype= entry. This uses the same method as above.

import tables as tb
import numpy as np

test_list = [['a01','a02','a03','a04','a05','a06'], 
             ['a11','a12','a13','a14','a15','a16','a17'], 
             ['a21','a22','a23','a24','a25','a26','a27','a28']]
   
slen = max(len(item) for sublist in test_list for item in sublist)

with tb.File('example_nested_tb.h5', 'w') as h5f:        
    vlarray = h5f.create_vlarray('/','vla_test', tb.StringAtom(slen) ) 
    for slist in test_list:
        arr = np.array(slist,dtype='S'+str(slen))
        vlarray.append(arr)

    print('-->', vlarray.name)
    for row in vlarray:
        print('%s[%d]--> %s' % (vlarray.name, vlarray.nrow, row))
kcw78
  • 7,131
  • 3
  • 12
  • 44
1

You are close. The data= argument is designed to work with an existing NumPy array. When you use a List, behind the scenes it is converted to an Array. It works for a List of numbers. (Note that Lists and Arrays are different Python object classes.)

You ran into an issue converting a list of strings. By default, the dtype is set to NumPy's Unicode type ('<U2' in your case). That is a problem for h5py (and HDF5). Per the h5py documentation: "HDF5 has no support for wide characters. Rather than trying to hack around this and “pretend” to support it, h5py will raise an error if you try to store data of this type." Complete details about NumPy and strings at this link: h5py doc: Strings in HDF5

I modified your example slightly to show how you can get it to work. Note that I explicitly created the NumPy array of strings, and declared dtype='S2' to get the desired string dtype. I added an example using a list of integers to show how a list works for numbers. However, NumPy arrays are the preferred data object.

I removed the f.close() statement, as this is not required when using a context manager (with / as: structure)

Also, be careful with the compression level. You will get (slightly) more compression with compression_opts=9 compared to compression_opts=1, but you will pay in I/O processing time each time you access the dataset. I suggest starting with 1.

import h5py
import numpy as np

test_array = np.array(['a1','a2','a1','a2','a1','a2', 'a1','a2', 
                       'a1','a2','a1','a2','a1','a2', 'a1','a2', 
                       'a1','a2','a1','a2','a1','a2', 'a1','a2'], dtype='S2')

data_list = [ 1, 2, 3, 4, 5, 6, 7, 8, 9 ]

with h5py.File('example_file.h5', 'w') as f:
     f.create_dataset('test3', data=test_array, compression='gzip', compression_opts=9) 

     f.create_dataset('test4', data=data_list, compression='gzip', compression_opts=1) 
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Wow. I love this. Did you try with nested lists too? I'll accept this though also interested in nested lists. Thanks for the effort!!!! – lobjc Mar 05 '21 at 15:36
  • HDF5 supports nested arrays. It's more complicated, but can be done. Can you provide an example? Warning: the biggest complication is checking the sizes of the nested Lists. Generally, HDF5 datasets use fixed shapes (sizes), so you have to know the shape of each before creating them. – kcw78 Mar 05 '21 at 15:41
  • I made the lists that they are all equal len(): pp =my_ list, chunks = [pp[x:x+700] for x in range(0, len(pp), 700)], then data_list will be my 'chunks' – lobjc Mar 05 '21 at 15:50
  • I noted that when reading the h5 file; hf = h5py.File('drugrx_nested.h5', 'r') data = hf['test3'][:], the lists come out with bytes strings: [b'a1',b'a2',b'a1',b'a2',b'a1',b...']. Any way to produce strings instead? – lobjc Mar 06 '21 at 04:36
  • 1
    You have an array of bytestrings. You have to decode them. There are several ways to do this. The simplest is adding `.astype('U')` to your data statement (this converts them to Unicode) -- like this `data = hf['test3'][:].astype('U')`. See this SO answer for other methods: [How to decode a numpy array of encoded strings](https://stackoverflow.com/a/40391991/10462884) – kcw78 Mar 06 '21 at 14:41