I am working with 4-50 GB HDF5 files. One file has an HDF5 Group ("outer_group"
) with several subgroups. The last subgroup ("sub_sub_group"
) has a dataset ("dataset1"
), which I want to read and convert to a Pandas data frame. "dataset1"
contains four columns "A"
, "B"
, "C"
, and "D"
and mixed data types (string, numeric, byte, etc).
- test_file.h5
- outer_group
- sub_group
- sub_sub_group
- dataset1
- sub_sub_group
- sub_group
- outer_group
This is how I am currently loading the data into a Pandas dataframe:
import h5py
import pandas as pd
f = h5py.File("test_file.h5", "r")
values_df = pd.DataFrame.from_records(
f["outer_group"]["sub_group"]["sub_sub_group"]["dataset1"][()],
columns=["A", "B", "C", "D"]
)
However, this takes several minutes to load. Does anyone know of a faster solution for loading the "dataset1"
hdf5 attribute to a Pandas data frames when working with large files? I have looked into read_hdf
and HDF_Store
, but I am not sure how to access the specific attribute. I also read here that my current approach (combining h5py with Pandas) might lead to issues.
Any insight is greatly appreciated.