1

Goal

Read the data component of a hdf5 file in R.

Problem

I am using rhdf5 to read hdf5 files in R. Out of 75 files, it successfully read 61 files. But it throws an error about memory for the rest of the files. Although, some of these files are shorter than already read files.
I have tried running individual files in a fresh R session, but get the same error.
Following is an example:

# Exploring the contents of the file:
library(rhdf5)

h5ls("music_0_math_0_simple_12_2022_08_08.hdf5")
    group                                   name       otype  dclass         dim
0       /                                   data   H5I_GROUP                    
1   /data                              ACC_State H5I_DATASET INTEGER       1 x 1
2   /data                       ACC_State_Frames H5I_DATASET INTEGER           1
3   /data                            ACC_Voltage H5I_DATASET   FLOAT   24792 x 1
4   /data                    AUX_CACC_Adjust_Gap H5I_DATASET INTEGER   24792 x 1
... CONTINUES ----

# Reading the file:
rhdf5::h5read("music_0_math_0_simple_12_2022_08_08.hdf5", name = "data")
Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  : 
  Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter.
In addition: Warning message:
In h5checktypeOrOpenLoc(file, readonly = TRUE, fapl = NULL, native = native) :
  An open HDF5 file handle exists. If the file has changed on disk meanwhile, the function may not work properly. Run 'h5closeAll()' to close all open HDF5 object handles.
Error: Error in h5checktype(). H5Identifier not valid.

I can read the file via python:

import h5py
filename = "music_0_math_0_simple_12_2022_08_08.hdf5"

hf = h5py.File(filename, "r")
hf.keys()
data = hf.get('data')
data['SCC_Follow_Info']
#<HDF5 dataset "SCC_Follow_Info": shape (9, 24792), type "<f4">

How can I successfully read the file in R?

umair durrani
  • 5,597
  • 8
  • 45
  • 85

1 Answers1

1

When you ask to read the data group, rhdf5 will read all the underlying datasets into R's memory. It's not clear from your example exactly how much data this is, but maybe for some of your files it really is more than the available memory on your computer. I don't know how Python works under the hood, but perhaps it doesn't do any reading of datasets until you run data['SCC_Follow_Info']?

One option to try, is that rather than reading the entire data group, you could be more selective and try reading only the specific dataset you're interested in at that moment. In the Python example that seems to be /data/SCC_Follow_Info.

You can do that with something like:

follow_info <- h5read(file = "music_0_math_0_simple_12_2022_08_08.hdf5", 
                      name = "/data/SCC_Follow_Info")

Once you've finished working with that dataset remove it from your R session e.g. rm(follow_info) and read the next dataset or file you need.

Grimbough
  • 81
  • 4