14

I would like to save a whole bunch of relatively large data frames while minimizing the space that the files take up. When opening the files, I need to be able to control what names they are given in the workspace.

Basically I'm looking for the symantics of dput and dget but with binary files.

Example:

n<-10000

for(i in 1:100){
    dat<-data.frame(a=rep(c("Item 1","Item 2"),n/2),b=rnorm(n),
        c=rnorm(n),d=rnorm(n),e=rnorm(n))
    dput(dat,paste("data",i,sep=""))
}


##much later


##extract 3 random data sets and bind them
for(i in 1:10){
    nums<-sample(1:100,3)
    comb<-rbind(dget(paste("data",nums[1],sep="")),
            dget(paste("data",nums[2],sep="")),
            dget(paste("data",nums[3],sep="")))
    ##do stuff here
}
Ian Fellows
  • 17,228
  • 10
  • 49
  • 63

2 Answers2

23

Your best bet is to use rda files. You can use the save() and load() commands to write and read:

set.seed(101)
a = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

save(a, file="test.rda")
load("test.rda")

Edit: For completeness, just to cover what Harlan's suggestion might look like (i.e. wrapping the load command to return the data frame):

loadx <- function(x, file) {
  load(file)
  return(x)
}  

loadx(a, "test.rda")

Alternatively, have a look at the hdf5, RNetCDF and ncdf packages. I've experimented with the hdf5 package in the past; this uses the NCSA HDF5 library. It's very simple:

hdf5save(fileout, ...)
hdf5load(file, load = TRUE, verbosity = 0, tidy = FALSE)

A last option is to use binary file connections, but that won't work well in your case because readBin and writeBin only support vectors:

Here's a trivial example. First write some data with "w" and append "b" to the connection:

zz <- file("testbin", "wb")
writeBin(1:10, zz)
close(zz)

Then read the data with "r" and append "b" to the connection:

zz <- file("testbin", "rb")
readBin(zz, integer(), 4)
close(zz)
Shane
  • 98,550
  • 35
  • 224
  • 217
  • Nice answer Shane. I'd like to use 'save', but don't like the fact that I can't control the name of the data on loading – Ian Fellows Oct 28 '09 at 15:00
  • You could wrap the load() function in a new function that knows the name of the data in the file and renames it for a return value. The load function will insert the variables into the environment/namespace of the function. – Harlan Oct 28 '09 at 15:22
  • You can do what Harlan suggested, or you can just save one dataframe per file, and give both the file and dataframe the same name. Then you will have the same behavior as what you described above with dput and dget, right? – Shane Oct 28 '09 at 15:38
  • 1
    You have basically reinvented `loadRDS` – hadley Apr 20 '11 at 14:02
  • You can pass a `compress` argument with a value of `bzip2` or `xz` to `save` to use a more efficient compression algorithm. The default is `gzip`. The new command would be `save(a, file="test.rda", compress="xz")` – Dan Gerlanc Oct 02 '12 at 15:52
12

You may have a look at saveRDS and readRDS. They are functions for serialization.

x = data.frame(x1=runif(10), x2=runif(10), x3=runif(10))

saveRDS(x, file="myDataFile.rds")
x <- readRDS(file="myDataFile.rds")
hadley
  • 102,019
  • 32
  • 183
  • 245
wind
  • 313
  • 4
  • 7
  • 4
    Out of curiosity: why would someone use these over save/load? Is there some particular benefit? – Shane Oct 29 '09 at 12:41
  • 1
    In 2.13 they are no longer internal. You use them when you want to save a single object, not multiple objects like `save()` – hadley Apr 20 '11 at 14:01
  • I get: Error: could not find function "readRDS", same for saveRDS. What library needs to be loaded? – Translunar Sep 20 '11 at 19:54
  • mohawkjohn - they are part of base R, no need to load anything in order to use them. – Tal Galili Apr 12 '13 at 10:31