3

In SAS there's an method of creating Library (using LIBNAME). This is helpful as when we have to do long data processing, we don't change always the dataset name. So, if we want to use a dataset again, without changing the name, we can put in a library. So, even if the dataset name are same, but since they are in different libraries, we can work on them together.

My question is there any such option in R that can create Library (or separate folder within R) so that we can save our data there?

Here's the example:

Suppose I've a dataset "dat1". I summarize variables in dat1 var1 & var2 for var3.

proc summary data=dat1 nway missing;
  var var1 var2;
  class var3;
  output out=tmp.dat1 (drop = _freq_ _type_) sum = ;
  run;

Then I merged dat1 with dat2, which is another dataset.Both dat1 & dat2 has common variable var3, with which I merged. I created new dataset dat1 again.

proc sql;
   create table dat1 as
   select a.*,b.*
   from dat1 a left join tmp.dat2 b
   on a.var3=b.var3;
  quit;

Now, I'm again summarizing dataset dat1 after merging to check if the values of var1 & var 2 remain the same before & after merging.

proc summary data=dat1 nway missing;
  var var1 var2;
  class var3;
  output out=tmp1.dat1 (drop = _freq_ _type_) sum = ;
  run;

The equivalent code in R will be

dat3 <- ddply(dat1,
              .(var3),
              summarise,
              var1 = sum(var1,na.rm=TRUE),
              var2 = sum(var2,na.rm=TRUE))

dat1 <- sqldf("select a.*,b.* 
                 from dat1 a 
                      left join dat2 b 
                             on a.var3=b.var3")

dat4 <- ddply(dat1,
              .(var3),
              summarise,
              var1 = sum(var1,na.rm=TRUE),
              var2 = sum(var2,na.rm=TRUE))

In case of SAS I used just 2 dataset name. But in case of R, I'm using 4 dataset name. So, if I'm writing 4000 line code for data processing, having too many dataset name sometimes become overwhelming. In sas it became easy to have same dataset name as I'm using 2 libraries tmp, tmp1 other than the default work library.

In SAS, library is defined as:

LIBNAME tmp "directory_path\folder_name";

In this folder, dat1 will be stored.

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
Beta
  • 1,638
  • 5
  • 33
  • 67
  • 4
    This question may make sense to a SAS user, but it makes no sense to the rest of us. Why don't you explain what you want out of R and how the current way you do things is lacking? Perhaps with a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? – Ari B. Friedman Oct 14 '12 at 10:57
  • 1
    To save datasets please see `?save` (`?load` to load them). – sgibb Oct 14 '12 at 11:10
  • Your problem is you are writing 4000 line scripts. This may not be a problem in SAS where anything over five lines is confusing already, but in R you should never write anything more than about ten lines without thinking "hey, this should be wrapped up in a function." – Spacedman Oct 14 '12 at 13:04
  • Thanks Spacedman for your comment. My problem is not with writing 4K line script. If you see the example, in SAS I used around 20 lines. But in R I did the same thing in just 3 lines. So R is more efficient when writing code. But in case of R I've to define more data names, which don't have to do in SAS just because it has library option. I just want something which is equivalent of library. – Beta Oct 14 '12 at 14:03
  • I think that the kind of grouping you want is provided by a `list`. – Paul Hiemstra Oct 14 '12 at 14:37
  • Not exactly Paul. Actually when you define a library (using libname), the dataset associated with that library will be stored in that specific folder rather than in SAS. In R, if you create a dataset, it will 1st store in R environment. Then when you save it, using save statement as you have mentioned, it will be stored in a particular folder. There's only 1 question I can see somebody ask about libname, way back in 2009. But it's not exactly my problem though. Putting the link here: http://r.789695.n4.nabble.com/libname-version-in-R-td899405.html – Beta Oct 14 '12 at 14:49
  • Look, don't take this the wrong way, but frequently when switching languages, the answer to "how do I do X in language Y" is "You don't". Aside from the good options already provided, if you wrap calculations like these in functions, you can return only the data sets you want, and any temp data sets you create won't clutter your workspace. – joran Oct 14 '12 at 17:05
  • Thanks Joran for your answer. And I didnt take your comment wrong way. Actually I'm very indebted to lot of you guyz who not only help me solve my problems from time-to-time, but also help me learn R. Unless language or comment too offensive, you guyz are alwayz awesome to me. The purpose of putting this problem is "Library" option in SAS is actually very helpful. I put one reason for having this option. There are many more. So, I was wondering if R also has this option. If it doesn't, it nowhere impair my R programmes. – Beta Oct 14 '12 at 17:11
  • @user697363 what you want exactly is in a state of flux. If you want an exact answer please provide an exact question. What _exactly_, e.g. in a list, do you want in terms of behavior in R. That way we can provide accurate advice. – Paul Hiemstra Oct 14 '12 at 21:37
  • 2
    it sounds like you might want to work with different, named environments – Glen_b Oct 15 '12 at 01:07
  • @PaulHiemstra: I actually put the example of what I'm looking for. Both "Save" & "list" option does not help this problem. – Beta Oct 16 '12 at 07:33
  • @Glen_b: Something like that. – Beta Oct 16 '12 at 07:33
  • @user697363 Just in case I was unclear, that's an explicit way to tackle that sort of issue in R -- using named environments. – Glen_b Oct 16 '12 at 08:53
  • @Glen_b: Could please explain it. Or put it in answer? – Beta Oct 16 '12 at 15:01
  • To play Devil's advocate with the number of datasets you have to create between R and SAS, you aren't storing equivalent things with SAS and R. In SAS, you are not storing the results (via, say, `ODS OUTPUT`) of the `proc summary` steps where in R you are saving the results of the `ddply` calls. So of course R is using more variable names. – Brian Diggs Oct 16 '12 at 22:25

4 Answers4

6

From what I can gather from the SAS onlinehelp, a SAS library is a set of datasets that is stored in a folder, and can be referenced as a unit. The equivalent in R would be to store the R objects you want to save using save:

save(obj1, obj2, etc, file = "stored_objects.rda")

Loading the objects can be done using load.

edit: I dont really see why having an additional object or two is so much of a problem. However, if you want to reduce tge amount of object just put your results in a list.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Thanks for your answer Paul. But sorry to say that it's not what I'm looking for. In my case obj1 & obj2 has the same name. But the folder where is is saved is different. So even though I've same dataset name, I can use it as a different dataset because of "library" option. – Beta Oct 14 '12 at 14:05
  • Looking at this from a SAS programmers perspective, Paul's answer seems correct. Even though SAS can access objects across different libraries at the same time, and R only works across a current workspace, you should be able to accomplish what you need to do by saving and loading the appropriate objects from the different workspaces when needed. Note: This only applies to load/save objects, not images, which would of course wipe out your current workspace. – Ralph Winters Oct 14 '12 at 21:22
  • @RalphWinters does a SAS library also provide automatic out-of-memory calculations if the datasize exceeds RAM? – Paul Hiemstra Oct 14 '12 at 21:38
  • @PaulHiemstra - Yes. No need to keep everything in core. Using multiple SAS library is roughly equivalent to working with multiple workspaces at the same time. – Ralph Winters Oct 15 '12 at 00:39
  • @RalphWinters, you could probably emulate this behavior using environment combined with `save` and `load`. – Paul Hiemstra Oct 15 '12 at 07:07
  • @RalphWinters: Paul's answer is not exactly what I was looking for. If I need to save a file, I can always save it using "write". I guess you understand why "Library" is SAS is used for. I'm looking exactly the same functionality of Library. if that's not there in R, then it's fine. – Beta Oct 16 '12 at 07:29
  • @user697363 - There is no direct equivalent, just a workaround – Ralph Winters Nov 10 '12 at 17:39
5

There are two separate aspects of SAS's libraries which (it seems) you are interested in.

  • Specification of the directory in which data files are stored
  • Ability to easily point an analysis to a different set of identically named datasets by just specifying the different location

Taking these in that order.

The problem with answering the first is that R and SAS have different models for how data is stored. R stores data in memory, organized in environments arranged in a particular search order. SAS stores data on disk and the names of datasets correspond to file names within a specified directory (there likely is caching in memory for optimization, but conceptually this is how data is stored). R can store (sets of) objects in a file on disk using save() and bring them back into memory using load(). The filename and directory can be specified in those function calls (hence Paul's answer). You could have several .RData files, each containing objects named dat1, dat2, etc. which can be loaded prior to running an analysis and the results can be written out to (other) .RData files.

An alternative to this would be using one of the extensions which give data types which are backed by disk storage instead of memory. I've not had experience with any of them to talk about how well they would work in this situation, but that is an option. [Edit: mnel's answer has a detailed example of just this idea.]

Your second part can be approached different ways. Since R uses in-memory data, the answers would focus around arranging different environments (each of which can contain different but identically named data sets) and controlling which one gets accessed via attach()ing and detach()ing the environments from the search path (what Glen_b's answer gets toward). You still don't have the disk backing of the data, but that is the previous problem.

Finally, @joran's admonition is relevant. The solution to the problem of performing a set of tasks on potentially different (but related) sets of data in R is to write a function to do the work. The function has parameters. Within the function, the parameters are referred to by the names given in the argument list. When the function is called, which particular set of data is sent to it specified by the function call; the names inside and outside the function need not have anything to do with each other. The suggestions about storing the multiple sets of data in a list are implicitly approaching the problem this way; the function is called for each set of data in the list in turn. Names don't matter, then.

Community
  • 1
  • 1
Brian Diggs
  • 57,757
  • 13
  • 166
  • 188
4

Here is an example using the SOAR package and named environments

To quote from the vignette

Objects need not be always held in memory. The function save may be used to save objects on the disc in a file, typically with an .RData extension. The objects may then be removed from memory and later recalled explicitly with the load function.

The SOAR package provides simple way to store objects on the disc, but in such a way that they remain visible on the search path as promises, that is, if and when an object is needed again it is automatically loaded into memory. It uses the same lazy loading mechanism as packages, but the functionality provided here is more dynamic and exible

It will be useful to read the whole vignette

library(SOAR)
library(plyr)
library(sqldf)
set.seed(1)

# create some dummy data create a named envirment
tmp <- new.env(parent = .GlobalEnv)
dat1 <- data.frame(var1 = rnorm(50),
                   var2 = sample(50, replace = TRUE),
                   var3 = sample(letters[1:5], 50, replace = TRUE))

tmp$dat1 <- ddply(dat1, .(var3), summarise,
                  var1 = sum(var1, na.rm = TRUE), 
                  var2 = sum(var2, na.rm = TRUE))

tmp$dat2 <- data.frame(Var3 = sample(letters[1:5], 20, replace = TRUE), 
                       Var4 = 1:20)

# store as a SOAR cached object (on disc)
Store(tmp, lib = "tmp")

# replace dat1 within the global enviroment using sqldf create a new
# environment to work in with the correct version of dat1 and dat2
sqlenv <- tmp
sqlenv$dat1 <- dat1

dat1 <- sqldf("select a.*,b.* from dat1 a left join dat2 b on a.var3=b.var3", 
              envir = sqlenv)

# create a new named enviroment tmp1
tmp1 <- new.env(parent = .GlobalEnv)

tmp1$dat1 <- ddply(dat1, .(var3), summarise, 
                   var1 = sum(var1, na.rm = TRUE), 
                   var2 = sum(var2, na.rm = TRUE))

# store using a SOAR cache
Store(tmp1, lib = "tmp")


tmp1$dat1

##   var3   var1 var2
## 1    a  1.336  378
## 2    b  8.514 1974
## 3    c  5.795  624
## 4    d -8.828  936
## 5    e 20.846 1490

tmp$dat1

##   var3    var1 var2
## 1    a  0.4454  126
## 2    b  1.4190  329
## 3    c  1.9316  208
## 4    d -2.9427  312
## 5    e  4.1691  298

I'm not sure you should expect tmp1$dat1 and tmp$dat1 to be identical (given my example anyway)

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
mnel
  • 113,303
  • 27
  • 265
  • 254
  • 1
    This one really helped. Library option in SAS is one aspect I could not replicate in R till now. But with your answer removed that barrier as well. Thanks a lot! – Beta Oct 17 '12 at 14:00
2

Named environments are one of a number of ways of achieving what it sounds like you want.

Personally, if there aren't a lot of different data frames or lists, I'd lean toward organizing it other ways, such as inside either data frames or lists, depending on how your data is structured. But if each thing consists of many different kinds of data and functions, environments may be significantly better. They're described in the help, and a number of posts to r-blogs discuss them.

But on reflection, R-Studio projects may be closer to the way you're thinking about the problem (and if you're not using R-Studio already, I highly recommend it). Have a look at how projects work.

Glen_b
  • 7,883
  • 2
  • 37
  • 48
  • Thanks a lot. Sorry for delayed reply as got occupied. I'm actually using R-Studio. But for some reason never explored project option. But I'll do that. It might give me additional benefit, in addition to mnel's answer. Nice to see your point on named environment finally gave me an answer. :) – Beta Oct 17 '12 at 14:05