11

I am interested in listing objects in an RDATA file and loading only selected objects, rather than the whole set (in case some may be big or may already exist in the environment). I'm not quite clear on how to do this when there are conflicts in names, as attach() doesn't work as nicely.

1: For examining the contents of an R data file without loading it: This question is similar, but different from, the one asked at listing contents of an R data file without loading

In that case, the solution offered was:

attach(filename)
ls(pos = 2)
detach()

If there are naming conflicts between objects in the file and those in the global environment, this warning appears: The following object(s) are masked _by_ '.GlobalEnv':

I tried creating a new environment, but I cannot seem to attach into that. For instance, this produces the same error:

lsfile   <- function(filename){
  tmpEnv <- new.env()
  evalq(attach(filename), envir = tmpEnv)
  tmpls <- ls(pos = 2)
  detach()
  return(tmpls)
}
lsfile(filename)

Maybe I've made a mess of things with evalq (or eval). Is there some other way to avoid the naming conflict?

2: If I want to access an object - if there are no naming conflicts, I can just work with the one from the .rdat file, or copy it to a new one. If there are conflicts, how does one access the object in the file's namespace?

For instance, if my file is "sample.rdat", and the object is surveyData, and a surveyData object already exists in the global environment, then how can I access the one from the file:sample.rdat namespace?

I currently solve this problem by loading everything into a temporary environment, and then copy out what's needed, but this is inefficient.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111
  • 1
    I've seen the request come up on r-devel and there did not seem to be an alternative to loading the full .Rdta file. – IRTFM Jul 01 '11 at 19:23

4 Answers4

23

Since this question has just been referenced let's clarify two things:

  1. attach() simply calls load() so there is really no point in using it instead of load

  2. if you want selective access to prevent masking it's much easier to simply load the file into a new environment:

    e = local({load("foo.RData"); environment()})
    

    You can then use ls(e) and access contents like e$x. You can still use attach on the environment if you really want it on the search path.

FWIW .RData files have no index (the objects are stored in one big pairlist), so you can't list the contained objects without loading. If you want convenient access, convert it to the lazy-load format instead which simply adds an index so each object can be loaded separately (see Get specific object from Rdata file)

Community
  • 1
  • 1
Simon Urbanek
  • 13,842
  • 45
  • 45
  • 1
    First, let me join the chorus welcoming you to SO! Second, the fact that `attach()` calls `load()` is surprising to me. Although one goal was to use `attach()`, when I accepted the other answer, I had a test case (which I don't have on hand at the moment) where it seemed that `attach()` was substantially faster than `load()`. OTOH, I realize now that I may have been misled by disk caching. Curses. I will need to revisit that test case to see what happened. – Iterator Jan 02 '12 at 17:04
  • 1
    Heh, thanks - I saw links to misinformation on SO (I don't mean this post) I thought I'll keep an eye on it ;). Re `attach`, I didn't check the speed, I just looked at the current R code for it :P – Simon Urbanek Jan 03 '12 at 02:08
8

I just use an env= argument to load():

> x <- 1; y <- 2; z <- "foo"
> save(x, y, z, file="/tmp/foo.RData")
> ne <- new.env()
> load(file="/tmp/foo.RData", env=ne)
> ls(env=ne)
[1] "x" "y" "z"
> ne$z
[1] "foo"
> 

The cost of this approach is that you do read the whole RData file---but on the other hand that seems to be unavoidable anyway as no other method seems to offer a list of the 'content' of such a file.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • and if you hate environment clutter, `rm(ne)` can be used to destroy the new environment once you copied the stuff that your interested in into the global env. – Paul Lemmens May 25 '16 at 08:32
4

You can suppress the warning by setting warn.conflicts=FALSE on the call to attach. If an object is masked by one in the global environment, you can use get to retreive it from your attached data.

x <- 1:10
save(x, file="x.rData")
#attach("x.rData", pos=2, warn.conflicts=FALSE)
attach("x.rData", pos=2)
(x <- 1)
# [1] 1
(x <- get("x", pos=2))
# [1]  1  2  3  4  5  6  7  8  9 10
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • Following up on a comment left below the answer by @SimonUrbanek, it seems I erred in thinking that `attach()` doesn't call `load()` (unless R has changed :)). That error was due to lack of inspection (`Rprof()` makes that clear...), and super fast loading times (probably due to disk caching). Still, this answers the second part of my question. It seems that the only way to avoid loading on a per use basis is to make a one-time conversion to lazy load format, as Simon points out. – Iterator Jan 02 '12 at 17:17
2

Thanks to @Dirk and @Joshua.

I had an epiphany. The command/package foreach with SMP or MC seems to produce environments that only inherit, but do not seem to conflict with, the global environment.

lsfile   <- function(list_files){
    aggregate_ls = foreach(ix = 1:length(list_files)) %dopar% {
      attach(list_files[ix])
      tmpls <- ls(pos = 2)
      return(tmpls)
    }
  return(aggregate_ls)
}

lsfile("f1.rdat")
lsfile(dir(pattern = "*rdat"))

This is useful to me because I can now parallelize this. This is a bare-bones version, and I will modify it to give more detailed information, but so far it seems to be the only way to avoid conflicts, even without ignore.

So, question #1 can be resolved by either ignoring the warnings (as @Joshua suggested) or by using whatever magic foreach summons.

For part 2, loading an object, I think @Joshua has the right idea - "get" will do.

The foreach magic can also work, by using the .noexport option. However, this has risks: whatever isn't specifically excluded will be inherited/exported from the global environment (I could do ls(), but there's always the possibility of attached datasets). For safety, this means that get() must still be used to avoid the risk of a naming conflict. Loading into a subenvironment avoids the naming conflict, but doesn't avoid the loading of unnecessary objects.

@Joshua's answer is far simpler than my foreach detour.

Iterator
  • 20,250
  • 12
  • 75
  • 111