0

I am working with 4-50 GB HDF5 files. One file has an HDF5 Group ("outer_group") with several subgroups. The last subgroup ("sub_sub_group") has a dataset ("dataset1"), which I want to read and convert to a Pandas data frame. "dataset1" contains four columns "A", "B", "C", and "D" and mixed data types (string, numeric, byte, etc).

  • test_file.h5
    • outer_group
      • sub_group
        • sub_sub_group
          • dataset1

This is how I am currently loading the data into a Pandas dataframe:

import h5py
import pandas as pd

f = h5py.File("test_file.h5", "r")

values_df = pd.DataFrame.from_records(
    f["outer_group"]["sub_group"]["sub_sub_group"]["dataset1"][()], 
    columns=["A", "B", "C", "D"]
    )

However, this takes several minutes to load. Does anyone know of a faster solution for loading the "dataset1" hdf5 attribute to a Pandas data frames when working with large files? I have looked into read_hdf and HDF_Store, but I am not sure how to access the specific attribute. I also read here that my current approach (combining h5py with Pandas) might lead to issues.

Any insight is greatly appreciated.

M. P.
  • 59
  • 6
  • Have you tried pd.read_hdf("./test_file.h5", key='/outer_group/sub_group/sub_sub_group/dataset1') – Rushikesh Talokar May 14 '21 at 00:31
  • `f["outer_group"]["sub_group"]["sub_sub_group"]["dataset1"][()]` should be producing a structured `numpy` array. What's array/dataframe size? I haven't seen speed comparisons, but as far as I know, `h5py` does a fairly direct load of the `HDF5` source, so should be competitive. – hpaulj May 14 '21 at 01:10
  • @Rushikesh, thank you; that answers how to access the specific attribute. It turned out that pd.read_hdf() took 100 seconds, compared to 83 seconds with h5py and pd.DataFrame.from_records. – M. P. May 14 '21 at 18:46
  • @hpaulj Thanks for your input. The array shape is (10852708, 4). The h5py.File method + pd.DataFrame.from_records completed after 83 seconds. I wonder if this is the standard expectation for timing for an array this large. – M. P. May 14 '21 at 18:46
  • If you read it into a numpy array (`f...['dataset1'][()]`) and then convert that to pandas afterwards, how long does each step take? That might show which bit is slow. And how big is one dataset in bytes? – Thomas K May 28 '21 at 17:43

0 Answers0