3

Given a list of list of strings, such as:

test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]

I'd like to store it using h5py such that:

f['test_dataset'][0] = ['a1','a2']
f['test_dataset'][0][0] = 'a1'
etc.

Following the advice in the thread H5py store list of list of strings, I tried the following:

import h5py
test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
with h5py.File('test.h5','w') as f:
    string_dt = h5py.special_dtype(vlen=str)
    f.create_dataset('test_dataset',data=test_array,dtype=string_dt)

However this results in each of the nested lists being stored as strings, i.e.:

f['test_dataset'][0] = "['a1', 'a2']"
f['test_dataset'][0][0] = '['

If this isn't possible using h5py, or any other hdf5-based library, I'd be happy to hear any suggestions of other possible formats/libraries that I could use to store my data.

My data consists of multidimensional numpy integer arrays and nested lists of strings as in the example above, with around >100M rows and ~8 columns.

Thanks!

Daniel Crane
  • 257
  • 3
  • 10
  • 1
    Have you considered storing a flattened version of the list of lists: `x = [ 'a1','a2', 'b1', 'c1','c2','c3','c4' ]`. If you then also store the indices `idx = [2,3]` (where `x` is to be split), then you can regenerate a list of arrays using `np.split(x, idx)`. – unutbu Jul 24 '17 at 02:15
  • 1
    Alternatively, perhaps you could store both `x` and "group numbers" `y = [0,0,1,2,2,2,2]`, which indicate which list the values in `x` are supposed to belong to. – unutbu Jul 24 '17 at 02:16
  • I was also considering this, however that makes adding new items to existing rows a bit more fiddly. I was also considering just keeping it like this, with the rows stored as strings, and just using eval() on them to convert back to strings when needed, but that comes with some other problems of its own. Thanks for your comment! – Daniel Crane Jul 24 '17 at 02:34
  • The [answer](https://stackoverflow.com/a/38465587) to the [question you linked](https://stackoverflow.com/q/37873311) has a link to [another question](https://stackoverflow.com/q/14639496), the [answers](https://stackoverflow.com/questions/14639496/python-numpy-array-of-arbitrary-length-strings#answers-header) of which seem like they might help you. (basically they say to use `dtype=object`, and point out that it'll make things slower.) – 3D1T0R Jul 24 '17 at 02:53
  • Thanks for the comment @3D1T0R, however when they say 'variable length strings' in that context, I believe they mean that it's just one list of strings each with a different length, i.e. ['one','twenty','one hundred']. I would like to do this, too, but in my case the problem is storing a multidimensional version of such an array. – Daniel Crane Jul 24 '17 at 02:56
  • @DanielCrane: It should allow storing any object, including strings, lists or anything else. – 3D1T0R Jul 24 '17 at 03:19
  • `h5py` does not store general `numpy` object dtype arrays. HDF5 variable length strings map on to `numpy` object arrays. As shown then flattened arrays of strings or string objects can be saved, but not nested lists. That is, `h5py` does not save Python lists. – hpaulj Jul 24 '17 at 03:52

2 Answers2

1

In Saving with h5py arrays of different sizes

I suggest saving a list of variable length arrays as multiple datasets.

In [19]: f = h5py.File('test.h5','w')
In [20]: g = f.create_group('test_array')
In [21]: test_array = [ ['a1','a2'], ['b1'], ['c1','c2','c3','c4'] ]
In [22]: string_dt = h5py.special_dtype(vlen=str)
In [23]: for i,v in enumerate(test_array):
    ...:     g.create_dataset(str(i), data=np.array(v,'S4'), dtype=string_dt)
    ...:     
In [24]: for k in g.keys():
    ...:     print(k,g[k][:])
    ...:     
0 ['a1' 'a2']
1 ['b1']
2 ['c1' 'c2' 'c3' 'c4']

For many small sublists this could be messy, though I'm not sure it's in efficient.

'flattening' with a list join might work

In [27]: list1 =[', '.join(x) for x in test_array]
In [28]: list1
Out[28]: ['a1, a2', 'b1', 'c1, c2, c3, c4']
In [30]: '\n'.join(list1)
Out[30]: 'a1, a2\nb1\nc1, c2, c3, c4'

The nested list can be recreated with a few split.

Another thought - pickle to a string and save that.


From the h5py intro

An HDF5 file is a container for two kinds of objects: datasets, which
are array-like collections of data, and groups, which are folder-like
containers that hold datasets and other groups. The most fundamental
thing to remember when using h5py is:

Groups work like dictionaries, and datasets work like NumPy arrays

pickle doesn't work

In [32]: import pickle
In [33]: pickle.dumps(test_array)
Out[33]: b'\x80\x03]q\x00(]q\x01(X\x02\x00\x00\x00a1q\x02X\x02\x00\x00\x00a2q\x03e]q\x04X\x02\x00\x00\x00b1q\x05a]q\x06(X\x02\x00\x00\x00c1q\x07X\x02\x00\x00\x00c2q\x08X\x02\x00\x00\x00c3q\tX\x02\x00\x00\x00c4q\nee.'
In [34]: f.create_dataset('pickled', data=pickle.dumps(test_array), dtype=string
    ...: _dt)
....
ValueError: VLEN strings do not support embedded NULLs

json

In [35]: import json
In [36]: json.dumps(test_array)
Out[36]: '[["a1", "a2"], ["b1"], ["c1", "c2", "c3", "c4"]]'
In [37]: f.create_dataset('pickled', data=json.dumps(test_array), dtype=string_d
    ...: t)
Out[37]: <HDF5 dataset "pickled": shape (), type "|O">
In [43]: json.loads(f['pickled'].value)
Out[43]: [['a1', 'a2'], ['b1'], ['c1', 'c2', 'c3', 'c4']]
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Ah yes, I saw this thread too when I was searching last week. The problem is that this would result in potentially tens or hundreds of millions of datasets, and I'm not sure if that'd affect performance or file size (maybe some overheads?) or not. – Daniel Crane Jul 24 '17 at 04:24
  • how about `join` to collect the strings into larger string units? The basic problem is that `HDF5` was developed primarily for multidimensional numeric data, not general lists of list of various items. – hpaulj Jul 24 '17 at 04:30
  • The method I'm currently using does a similar kind of flattening, with each of the nested lists becoming a string, i.e. "['a1','a2']", which can then be converted into an actual list by just using list1 = eval("['a1','a2']"). Storing it like this isn't the end of the world, I just worry about the code readability in the future, and thought it'd be more intuitive if it was possible to store as a real array rather than flattening or stringifying everything. :) Pickling might indeed end up being the best option, I'll look into that a bit more now, thanks for the comment. – Daniel Crane Jul 24 '17 at 04:36
  • `json` looks more promising than `pickle`. – hpaulj Jul 24 '17 at 04:40
0

ugly workaround

hf.create_dataset('test', data=repr(test_array))
周志华
  • 1
  • 1