2

I am dealing with a large dataset (~ TBs) that comes in .h5 format. To reduce size of dataset on my disk, I have extracted the variables of my choice from these .h5 files and saved the resultant dataframe into compressed binary format in R as shown below (just an example):

files <- list.files(path="path/to/data", recursive=F, pattern = "*.h5", full.names = TRUE)
files[grep(".h5$", files)]       # only need the H5 files for data extraction
for(file in files){
  print(file)
  time <- h5read(file, "/ABBY/dp0p/data/irgaTurb/000_050/time")
  CO2 <- h5read(file, "/ABBY/dp0p/data/irgaTurb/000_050/densMoleCo2")
  df <- data.frame(DateTime=time, densMoleCo2 = CO2)
  filename = paste0(substr(file, 1, nchar(file)-3), '.data')
  save(df, file=filename, compress = TRUE)
}

I have following questions:

  • Is my way to save data optimum or there is better way available to save dataframe "df" in reduced size format on my disk?
  • The saved file can be opened in R using load() function but I have to open the file using Python for some reason. How I can open the compressed binary file in Python?
raghav
  • 533
  • 2
  • 11
  • 1
    Not sure open your first question, but this might help with the second: https://stackoverflow.com/questions/21288133/loading-rdata-files-into-python Looks like there's a `pyreadr` package – Hobo Aug 07 '21 at 22:56
  • 1
    Perhaps the feather format would be more appropriate? You would need to run some benchmarks though, e.g. https://gist.github.com/gansanay/4514ec731da1a40d8811a2b3c313f836 – jared_mamrot Aug 07 '21 at 23:23
  • @jared_mamrot I tested multiple formats and found that my way of saving data was optimum. Saving data in the feather format resulted a file size of ~48 Mb as compare to ~28 Mb of my compressed binary format. Also, processing time was three time faster. Also, R wins over Python in processing .h5 files. – raghav Aug 08 '21 at 05:21

0 Answers0