2

let's imagine we have this situation:

  1. I have a lot of .RData files, which are more than 100mb (whatever, but big).
  2. In every .RData file there is dataset called "Dataset_of_interest" and all of them are part of one big data set, which I want to create.

So i wonder if there is possible to load into memory only this datasets which i am interested in, but not to load entire .RData files?

I would like to load each 'Dataset_of_interest' in loop, merge into one big file and then save it in one file.

EDIT: I work on Windows 7.

Maciej
  • 3,255
  • 1
  • 28
  • 43
  • 1
    There are several questions on SO about whether it is possible to only load specific contents from .RData files without first loading them. It seems that [it is not possible](http://stackoverflow.com/questions/4831050/listing-contents-of-an-r-data-file-without-loading) to do so. – A5C1D2H2I1M1N2O1R2T1 Aug 07 '12 at 06:35
  • 3
    [this question](http://stackoverflow.com/questions/8700619/get-specific-object-from-rdata-file) shows how to convert to lazy-loadable data sets and files, which (I think) is what you want to do. – mnel Aug 07 '12 at 06:47

1 Answers1

3

I would argue that this is possible, but would require some parallel processing capabilities. Each worker would load the .RData file and output the desired object. Merging the result would probably be pretty straightforward.

I can't provide code for your data because I don't know the structure, but I would do something along the lines of the below chunk'o'code. Note that I'm on Windows and your workflow may differ. You should not be short on computer memory. Also, snowfall is not the only interface to use multiple cores.

# load library snowfall and set up working directory
# to where the RData files are
library(snowfall)
working.dir <- "/path/to/dir/with/files"
setwd(working.dir)

# initiate (redneck jargon: and then she ate) workers and export
# working directory. Working directory could be hard coded into
# the function, rendering this step moot
sfInit(parallel = TRUE, cpus = 4, type = "SOCK")
sfExport(list = c("working.dir")) # you need to export all variables but x

# read filenames and step through each, returning only the
# desired object
lofs <- list.files(pattern = ".RData")
inres <- sfSapply(x = lofs, fun = function(x, wd = working.dir) {
    setwd(wd)
    load(x)
    return(Dataset_of_interest)
  }, simplify = FALSE)
sfStop()

# you could post-process the data by rbinding, cbinding, cing...
result <- do.call("rbind", inres)
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
  • After sfExport I get this error: 'Error in sfExport(c("working.dir")) : Unknown/unfound variable ..1 in export. (local=TRUE)' what should I do? – Maciej Aug 07 '12 at 14:34
  • @Maciej you should probably write `sfExport(list = "working.dir")`. An alternative would probably be `sfExport(working.dir)`. I usually use the first alternative. – Roman Luštrik Aug 08 '12 at 08:17