2

This is what my dataframe looks like. The first column is a single int. The second column is a single list of 512 ints.

IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131    [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716    [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098    [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...

I saved it to hdf and tried opening it using

df.to_hdf('test.h5', key='df', data_columns=True)
h3 = h5py.File('test.h5')

I see 4 keys when I list the keys

h3['df'].keys()

KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']

Axis1 sees to contain the values for the first column

h3['df']['axis1'][0:5]

array([ 1899317, 22861131, 2163410, 15760716, 12244098,

However, there doesn't seem to be data from the second column. There does is another column with other data

h3['df']['block0_values'][0][0:5]

But that doesn't seem to correspond to any of the data in the second column

array([128, 4, 149, 1, 0], dtype=uint8)

Purpose

I am eventually trying to create a datastore that's memory mapped, that retrieves data using particular indices.

So something like

h3['df']['workingIndex'][22861131, 15760716] 

would retrieve

[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...],
[0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
Community
  • 1
  • 1
SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116

1 Answers1

3

The problem is you're trying to serialize a Pandas Series of Python lists and it is not rectangular (it is jagged).

Pandas and HDF5 are largely used for rectangular (cube, hypercube, etc) data, not for jagged lists-of-lists.

Did you see this warning when you call to_hdf()?

PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Ids']]

What it's trying to tell you is that lists-of-lists are not supported in an intuitive, high-performance way. And if you run an HDF5 visualization tool like h5dump on your output file, you'll see what's wrong. The index (which is well-behaved) looks like this:

  DATASET "axis1" {
     DATATYPE  H5T_STD_I64LE
     DATASPACE  SIMPLE { ( 5 ) / ( 5 ) }
     DATA {
     (0): 1899317, 22861131, 2163410, 15760716, 12244098
     }
     ATTRIBUTE "CLASS" {
        DATA {
        (0): "ARRAY"
        }
     }

But the values (lists of lists) look like this:

  DATASET "block0_values" {
     DATATYPE  H5T_VLEN { H5T_STD_U8LE}
     DATASPACE  SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
     DATA {
     (0): (128, 5, 149, 164, ...)
     }
     ATTRIBUTE "CLASS" {
        DATA {
        (0): "VLARRAY"
        }
     }
     ATTRIBUTE "PSEUDOATOM" {
        DATA {
        (0): "object"
        }
     }

What's happening is exactly what the PerformanceWarning warned you about:

> PyTables will pickle object types that it cannot map directly to c-types

Your list-of-lists is being pickled and stored as H5T_VLEN which is just a blob of bytes.

Here are some ways you could fix this:

  1. Store each row under a separate key in HDF5. That is, each list will be stored as an array, and they can all have different lengths. This is no problem with HDF5, because it supports any number of keys in one file.
  2. Change your data to be rectangular, e.g. by padding the shorter lists with zeros. See: Pandas split column of lists into multiple columns
  3. Use h5py to write the data in whatever format you like. It's much more flexible and creates simpler (and yet more powerful) HDF5 files than Pandas/PyTables. Here's one example (which shows h5py can actually store jagged arrays, though it's not pretty): Storing multidimensional variable length array with h5py
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • "rectangular (it is jagged)." All the data in the first column is only one number, and all the data in the second column are 512 ints. There are no lists-of-lists, the 2nd column only contains one list. I'll update the post to specify this information. – SantoshGupta7 May 10 '20 at 02:43
  • 1
    When I say "list of lists" I refer to your column of lists. In Pandas it is Series of lists, but the point is the same, you should convert it to a proper DataFrame with one number per column, or a 2D NumPy array. Point (2) at the end of my answer is then applicable, you just need to convert your Series of lists into something more explicitly 2D (because even if your lists happen to all have the same length, Pandas doesn't realize or take advantage of that automatically). – John Zwinck May 10 '20 at 02:47
  • 1
    Given this new information, I suggest simply converting `Ids` to a 2D NumPy array and saving it plus the 1D index using h5py directly. Don't bother with `to_hdf()`, it is too complicated for no benefit to you. – John Zwinck May 10 '20 at 02:49
  • Eventually, I need to be able to take a list of indices, which correspond to IndexID, and retrieve the corresponding Ids. I'm trying to think how to do this with your suggested setup. I can save the 2D Ids to a numpy memory map, but I can only index according to row number since there is no `IndexID`. But the h5py has the the `IndexID`. So I can look up the indices on the h5py datastore, and somehow get the corresponding row numbers, and the row numbers are used for the numpy memmap? Or were you thinking of something else? – SantoshGupta7 May 10 '20 at 03:01
  • 1
    @SantoshGupta7: If you need fast lookup, store `IndexID : row` in a hash table in memory, i.e. a `dict`. Even if the full table of `Ids` doesn't fit in memory, presumably the index does. – John Zwinck May 10 '20 at 03:21
  • I tried "Pandas split column of lists into multiple columns" and saved it using `to_hdf()`. Opening it using any keys (eg `h3['df']['axis1']`) results in KeyError: 'Unable to open object (bad object header version number)'. I was wondering if pandas was saving as an object somehow, but I checked df.dtypes and all were int64 – SantoshGupta7 May 10 '20 at 03:22
  • As I said, `to_hdf()` is overly complicated for your use case, and is only holding you back. Just use `h5py` for more flexible yet simpler storage. – John Zwinck May 10 '20 at 03:36