I have a bunch of hdf5 files, and I want to turn some of the data in them into parquet files. I'm struggling to read them into pandas/pyarrow though. Which I think is related to the way that the files were originally created.
If I open the file using h5py the data looks exactly how I would expect.
import h5py
file_path = "/data/some_file.hdf5"
hdf = h5py.File(file_path, "r")
print(list(hdf.keys()))
gives me
>>> ['foo', 'bar', 'baz']
In this case I'm interested in the group "bar", which has 3 items in it.
If I try to read the data in using HDFStore
I am unable to access any of the groups.
import pandas as pd
file_path = "/data/some_file.hdf5"
store = pd.HDFStore(file_path, "r")
Then the HDFStore
object has no keys or groups.
assert not store.groups()
assert not store.keys()
And if I try to access the data I get the following error
bar = store.get("/bar")
TypeError: cannot create a storer if the object is not existing nor a value are passed
Similarly if I try use pd.read_hdf
it looks like the file is empty.
import pandas as pd
file_path = "/data/some_file.hdf"
df = pd.read_hdf(file_path, mode="r")
ValueError: Dataset(s) incompatible with Pandas data types, not table, or no datasets found in HDF5 file.
and
import pandas as pd
file_path = "/data/some_file.hdf5"
pd.read_hdf(file_path, key="/interval", mode="r")
TypeError: cannot create a storer if the object is not existing nor a value are passed
Based on this answer I'm assuming that the problem is related to the fact that Pandas is expecting a very particular hierarchical structure, which is different to the one that the the actual hdf5 file has.
Is the a straightforward way to read an arbitrary hdf5 file into pandas or pytables? I can load the data using h5py if I need to. But the files are large enough that I'd like to avoid loading them into memory if I can. So ideally I'd like to work in pandas and pyarrow as much as I can.