19

I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so:

What have I not understood here?


The gory details

I have the following data in a HDF5 file:

   [...]
   DATASET "log" {
      DATATYPE  H5T_COMPOUND {
         H5T_COMPOUND {
            H5T_STD_U32LE "sec";
            H5T_STD_U32LE "usec";
         } "time";
         H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure";
         [...]

This was created via the C++ API. The h5check utility says the file is valid.

Note that CIF/align/aft_port_end/extend_pressure is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else.

Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an

 >>> import pandas as pd
 >>> store = pd.HDFStore('data/XXX-20150423-071618.h5')
 >>> store
/home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``::

  the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure'

The leaf will become an ``UnImplemented`` node. 

I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py...

Using h5py to read the file, I get what I want:

>>> f['/log'].dtype
>>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI
F/align/aft_port_end/extend_pressure', '<f4')[...]

Which is more or less what I set out with.

Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused.


Clearly, I could write a simple wrapper something like this:

>>> import matplotlib.pyplot as plt
>>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure'])
>>> silly.plot()
>>> plt.show()

to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...

Community
  • 1
  • 1
Sardathrion - against SE abuse
  • 17,269
  • 27
  • 101
  • 156

3 Answers3

7

I've browsed a bit through the h5check source and I can't find any place where it tests if a name contains a slash. You can examine the error messages it can produce with:

grep error_push h5checker.c -A1

The links you provided clearly state that slashes are not allowed in object names. So yes, I think you've made a file that is illegal but passes h5check. The tool seems to focus more on the binary data layout. The closest related check I can find is a guard against duplicate names.

In my opinion that's all there is to it. The fact that h5py and other libraries somehow are able to create or read this illegal file is irrelevant. The spec says "don't put slashes in object names", so you don't. End of story.

If you're not convinced, think of it like this: if you somehow managed to create a regular file with a slash in its file name, what would happen? Most programs assume that file names contains no slashes and thus that they are able to partition a directory path by splitting it at the slash characters. Your file would break this behavior and so introduce many subtle (and not so subtle) bugs. Users would complain, programmers would hate you, system administrators would curse you.

Likewise it's safe to assume that, next to PyTables, many other libraries and programs will not be able to handle slashes in variable names. The nice thing about HDF is that so many tools exist for it, and by using slashes you throw away that advantage. You may think that this this is not important, perhaps your HDF-5 files are for internal use only. However, the situation may change in 5 years, as situations tend to do.

Just bite the bullet and replace '/' with '|' before writing your variables to HDF5. Replace them back when you read them. The time you lose by implementing this, you'll win back x-fold (for x>1) by avoiding future bugs and user complaints.

Sorry about the rant but I hope to have convinced you.

titusjan
  • 5,376
  • 2
  • 24
  • 43
  • Up voted, accepted, and bounty awarded as this was the best answer. – Sardathrion - against SE abuse May 14 '15 at 10:52
  • 2
    @Sardathrion, your file is perfectly fine. There are no slash-related restrictions I am aware of on the labels of compound type members. The document you linked refers to names in the "group" namespace; i.e. the POSIX-style paths to objects in the file. – Andrew Collette May 15 '15 at 18:27
  • @andrew-collette: do you think it is a bug in PyTables then? Sardathrion, even though I apparently misinterpreted the specs and the file is correct, I stand by my point: by using slashes you get the risk of running into issues with other libraries and programs. This can easily be avoided by replacing them. – titusjan May 16 '15 at 12:47
  • Almost certainly a bug in PyTables. Slash replacement is fine as a workaround. I am not aware of any other environment with this behavior. – Andrew Collette May 18 '15 at 00:04
  • What if I'm dealing with a program that creates the hdf5 file and I have to use it as is. However, I know that the structure is set to not have 'groups' and only 'datasets', similar to a dictionary or key value pair, wherein the key is a string containing forward slashes. Is there a way to read the hdf5 this way and ignore the slashes? – chase May 31 '18 at 17:59
  • Altering the keys of the hdf5 in that case would be fine cosmetically, by changing the slashes to something else afterwards, it's mainly keeping the data value that matters. The problem is how to do that without reading it in first – chase May 31 '18 at 18:03
  • If you have to read the file 'as-is' I guess you have to deal with it. Apparently `h5py` can handle slashes so I suggest you try to read you file with `h5py` and open a new Stack Overflow question if you run into problems. – titusjan Jun 02 '18 at 13:11
2

Could you use h5py to read thru all your files and rewrite them without the offending characters, so that pytables can read them?

If it is outside the spec, I assume what you are experiencing is just that some implementations handle it and others do not...

tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
  • This would be a one time fix as long as you can update the c++ API that the bad characters originate from. Or you will have to put this preprocessing step into your workflow. – tmthydvnprt May 06 '15 at 20:08
  • The first part is what my last edit was about: yes, I can do that. However, it seems to be a little contrived. The latter part, I am not sure if it is outside the specifications or not. All of h5check, the C++ API, and h5py seems to think it's fine. Only PyTables complains. – Sardathrion - against SE abuse May 07 '15 at 06:27
0

Make sure you are creating groups rather than just the path name out right - this is probably where the fault creeps in. If you create the groups to your objects and then name the objects with the leaf names (extend_pressure in above) you won't have any problems anywhere.

H5py is a pretty thin wrapper around the C HDF5 library, pandas/pytables are a lot more heavy weight in approach - or at least they have alot more of their own semantics going on - and so they are checking to make sure you don't have '/' in your object names. But keep in mind everybody is using the HDF5 library at the end of the day because while HDF5 is great, it would be a huge effort to make an alternative implementation - beyond the resources of Pandas/Pytables.

Minor disclaimer: I've hacked on internals of HDF5 and H5py before.

Jason Newton
  • 1,201
  • 9
  • 13
  • `"CIF/align/aft_port_end/extend_pressure"` is not a path to a group node/leaf. It is a name in and off itself, just a label with internal structure that HDF5 should not care about. At least, that's the theory. – Sardathrion - against SE abuse May 08 '15 at 13:44
  • @Sardathrion yes, that was what I gathered. Think of it like a filesystem though - a filename can't have the path slashes (or its bad practice at least). Anyway while the HDF5 library works with it - it doesn't see 100% to the specification even though the same group made the specification - that's just a reality. It's unfortunate creating a dataset like that doesn't automatically create the path too as a convenience but such is the library - in order to avoid that error/issue use groups properly. – Jason Newton May 08 '15 at 13:59