What is the fastest way to read HDF5 attribute and convert to Pandas data frame when working with large files?

Question

I am working with 4-50 GB HDF5 files. One file has an HDF5 Group ("outer_group") with several subgroups. The last subgroup ("sub_sub_group") has a dataset ("dataset1"), which I want to read and convert to a Pandas data frame. "dataset1" contains four columns "A", "B", "C", and "D" and mixed data types (string, numeric, byte, etc).

test_file.h5
- outer_group
  - sub_group
    - sub_sub_group
      - dataset1

This is how I am currently loading the data into a Pandas dataframe:

import h5py
import pandas as pd

f = h5py.File("test_file.h5", "r")

values_df = pd.DataFrame.from_records(
    f["outer_group"]["sub_group"]["sub_sub_group"]["dataset1"][()], 
    columns=["A", "B", "C", "D"]
    )

However, this takes several minutes to load. Does anyone know of a faster solution for loading the "dataset1" hdf5 attribute to a Pandas data frames when working with large files? I have looked into read_hdf and HDF_Store, but I am not sure how to access the specific attribute. I also read here that my current approach (combining h5py with Pandas) might lead to issues.

Any insight is greatly appreciated.

Have you tried pd.read_hdf("./test_file.h5", key='/outer_group/sub_group/sub_sub_group/dataset1') — Rushikesh Talokar, May 14 '21 at 00:31
`f["outer_group"]["sub_group"]["sub_sub_group"]["dataset1"][()]` should be producing a structured `numpy` array. What's array/dataframe size? I haven't seen speed comparisons, but as far as I know, `h5py` does a fairly direct load of the `HDF5` source, so should be competitive. — hpaulj, May 14 '21 at 01:10
@Rushikesh, thank you; that answers how to access the specific attribute. It turned out that pd.read_hdf() took 100 seconds, compared to 83 seconds with h5py and pd.DataFrame.from_records. — M. P., May 14 '21 at 18:46
@hpaulj Thanks for your input. The array shape is (10852708, 4). The h5py.File method + pd.DataFrame.from_records completed after 83 seconds. I wonder if this is the standard expectation for timing for an array this large. — M. P., May 14 '21 at 18:46
If you read it into a numpy array (`f...['dataset1'][()]`) and then convert that to pandas afterwards, how long does each step take? That might show which bit is slow. And how big is one dataset in bytes? — Thomas K, May 28 '21 at 17:43

What is the fastest way to read HDF5 attribute and convert to Pandas data frame when working with large files?

0 Answers0