3

I have a bunch of hdf5 files, and I want to turn some of the data in them into parquet files. I'm struggling to read them into pandas/pyarrow though. Which I think is related to the way that the files were originally created.

If I open the file using h5py the data looks exactly how I would expect.

import h5py

file_path = "/data/some_file.hdf5"
hdf = h5py.File(file_path, "r")
print(list(hdf.keys()))

gives me

>>> ['foo', 'bar', 'baz']

In this case I'm interested in the group "bar", which has 3 items in it.

If I try to read the data in using HDFStore I am unable to access any of the groups.

import pandas as pd

file_path = "/data/some_file.hdf5"
store = pd.HDFStore(file_path, "r")

Then the HDFStore object has no keys or groups.

assert not store.groups()
assert not store.keys()

And if I try to access the data I get the following error

bar = store.get("/bar")
TypeError: cannot create a storer if the object is not existing nor a value are passed

Similarly if I try use pd.read_hdf it looks like the file is empty.

import pandas as pd

file_path = "/data/some_file.hdf"
df = pd.read_hdf(file_path, mode="r")
ValueError: Dataset(s) incompatible with Pandas data types, not table, or no datasets found in HDF5 file.

and

import pandas as pd

file_path = "/data/some_file.hdf5"
pd.read_hdf(file_path, key="/interval", mode="r")
TypeError: cannot create a storer if the object is not existing nor a value are passed

Based on this answer I'm assuming that the problem is related to the fact that Pandas is expecting a very particular hierarchical structure, which is different to the one that the the actual hdf5 file has.

Is the a straightforward way to read an arbitrary hdf5 file into pandas or pytables? I can load the data using h5py if I need to. But the files are large enough that I'd like to avoid loading them into memory if I can. So ideally I'd like to work in pandas and pyarrow as much as I can.

joris
  • 133,120
  • 36
  • 247
  • 202
Batman
  • 8,571
  • 7
  • 41
  • 80
  • If the data is loaded into a DataFrame it is in memory. Looks like you need to read the datasets as numpy arrays, and make the dataframe from those. Often pandas uses arrays without further copying. – hpaulj Mar 08 '22 at 01:19
  • You are correct -- Pandas uses a very specific schema (hierarchical structure) to create and read HDF5 files. The Pandas layout is shown in the referenced answer (as `axis0, axis1, block1_items`, etc . It is a valid HDF5 schema, just not one the average user would create from NumPy arrays with h5py or PyTables. What you want to do with the data in `'bar'`? As @hpaulj said, you can read the data with h5py and load to a dataframe. h5py dataset objects "behave like" numy arrays, but have a small memory footprint. – kcw78 Mar 08 '22 at 01:51

1 Answers1

4

I had a similar problem with not being able to read hdf5 into pandas df. With this post I made a script that turns the hdf5 into a dictionary and then the dictionary into a pandas df, like this:

import h5py
import pandas as pd


dictionary = {}
with h5py.File(filename, "r") as f:
    for key in f.keys():
        print(key)

        ds_arr = f[key][()]   # returns as a numpy array
        dictionary[key] = ds_arr # appends the array in the dict under the key

df = pd.DataFrame.from_dict(dictionary)

This works as long as each of the hdf5 keys (f.keys()) is simply a name of a column you want to use in the pandas df and not a group name, which seems to be a more complicated hierarchical structure that can exist in hdf5, but not in pandas. If a group appears in the hierarchy above the keys, e.g. with the name data_group, what worked for me as an alternative solution was to substitute f.keys() with f['data_group'].keys() and f[key] with f['data_group'][key]

NeStack
  • 1,739
  • 1
  • 20
  • 40