I am dealing with a large dataset (~ TBs) that comes in .h5 format
. To reduce size of dataset on my disk, I have extracted the variables of my choice from these .h5 files and saved the resultant dataframe into compressed binary format in R as shown below (just an example):
files <- list.files(path="path/to/data", recursive=F, pattern = "*.h5", full.names = TRUE)
files[grep(".h5$", files)] # only need the H5 files for data extraction
for(file in files){
print(file)
time <- h5read(file, "/ABBY/dp0p/data/irgaTurb/000_050/time")
CO2 <- h5read(file, "/ABBY/dp0p/data/irgaTurb/000_050/densMoleCo2")
df <- data.frame(DateTime=time, densMoleCo2 = CO2)
filename = paste0(substr(file, 1, nchar(file)-3), '.data')
save(df, file=filename, compress = TRUE)
}
I have following questions:
- Is my way to save data optimum or there is better way available to save dataframe "df" in reduced size format on my disk?
- The saved file can be opened in R using load() function but I have to open the file using Python for some reason. How I can open the compressed binary file in Python?