I've been using the 'rhdf5' package in R recently and have found it very useful until I attempted to read a file that was 190Mb or larger. In particular, I'm grabbing the data from a database, writing to HDF5 format (successfully, regardless of size) and then reading back in to R at a later time. When my file size exceeds 190Mb, I get the following error:
Error: segfault from C stack overflow
In my case, this corresponds to a dataframe with roughly 1950000 rows.
While reading the package documentation, I got the idea that chunking the data might solve the problem. However, chunking doesn't seem to work for compound dataframes. Here's some example code:
# save a matrix with chunking: works
mat = cbind(1:10, 11:20)
h5createFile("test.h5")
h5createDataset(file="test.h5", dataset="dat", dim=c(10,2), chunk=c(5,2), level=7)
h5write(mat, file="test.h5", name="dat")
# convert to data frame: won't work now
df = as.data.frame(mat)
df[,2] = as.character(mat[,2])
h5createFile("test2.h5")
h5createDataset(file="test2.h5", dataset="dat", dim=c(10,2), chunk=c(5,2), level=7)
h5write(df, file="test2.h5", name="dat")
#h5write(df, file="test2.h5", name="dat", index=list(1:10, 1:2))
# try to use alternate function
fid = H5Fcreate("test3.h5")
h5createDataset(file="test3.h5", dataset="dat", dim=c(10,2), chunk=c(5,2), level=7)
h5writeDataset.data.frame(df, fid, name="dat", level=7, DataFrameAsCompound=FALSE)
#h5writeDataset.data.frame(df, fid, name="dat", level=7, DataFrameAsCompound=FALSE, index=list(1:10,1:2))
It's possible that chunking won't help. Either way, I'd appreciate it if anyone has advice on reading large HDF5 files into R.