I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so:
What have I not understood here?
The gory details
I have the following data in a HDF5 file:
[...]
DATASET "log" {
DATATYPE H5T_COMPOUND {
H5T_COMPOUND {
H5T_STD_U32LE "sec";
H5T_STD_U32LE "usec";
} "time";
H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure";
[...]
This was created via the C++ API. The h5check utility says the file is valid.
Note that CIF/align/aft_port_end/extend_pressure
is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else.
Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an
>>> import pandas as pd
>>> store = pd.HDFStore('data/XXX-20150423-071618.h5')
>>> store
/home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``::
the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure'
The leaf will become an ``UnImplemented`` node.
I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py...
Using h5py to read the file, I get what I want:
>>> f['/log'].dtype
>>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI
F/align/aft_port_end/extend_pressure', '<f4')[...]
Which is more or less what I set out with.
Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused.
Clearly, I could write a simple wrapper something like this:
>>> import matplotlib.pyplot as plt
>>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure'])
>>> silly.plot()
>>> plt.show()
to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...