0

What are the advantages of placing data in a new .env in R?-speed, etc.

For data such as time series, is an new .env analogous to a database?

My question spans initally from downloading asset prices in R where it was suggested to place them into a new .env. Why is this so? Thank you:

library(TTR)

url = paste('http://www.nasdaq.com/markets/indices/nasdaq-100.aspx',sep="")
 txt = join(readLines(url)) 

 # extract tables from this pages
 temp = extract.table.from.webpage(txt, 'Symbol', hasHeader = T)
 temp[,2]

 # Symbols
 symbols = c(temp[,2])[2:101]

 currency("USD")
stock(symbols, currency = "USD", multiplier = 1)

# create new environment to store symbols
symEnv <- new.env()

# getSymbols and assign the symbols to the symEnv environment
getSymbols(symbols, from = '2002-09-01', to = '2013-10-17', env = symEnv)
Rhodo
  • 1,234
  • 4
  • 19
  • 35
  • http://stackoverflow.com/q/3470447/1412059 – Roland Nov 04 '13 at 16:26
  • 2
    Can you edit your question to add some detail or explanations of what circumstances led you to ask this? It would be best if you could phrase your question in a way that encouraged a single definitive answer, rather than a list of possibilities... – joran Nov 04 '13 at 16:32

2 Answers2

2

There are advantages to this if your data is large and you have to modify it by passing it through functions. When you send data.frames or vectors to functions that modify them, R will make a copy of the data before making changes to it. You'd then return the modified data from the function and overwrite the old data to complete the modification step.

If your data is large, copying the data for each function call may result in an undesirable amount of overhead. Using environments provides a way around this overhead. environments are handled differently by functions. If you pass an environment to a function and modify the contents, R will operate directly on the environment without making a copy of it. So by putting your data in an environment and passing the environment to the function instead of directly passing the data, you can avoid copying the large dataset.

# here I create a data.frame inside an environment and pass the environment
# to a function that modifies the data.
e <- new.env()
e$k <- data.frame(a=1:3)
f <- function(e) {e$k[1,1] <- 10}
f(e)
# you can see that the original data was changed.
e$k
   a
1 10
2  2
3  3

# alternatively, if I pass just the data.frame, the manipulations do not affect the 
# original data.
k <- data.frame(a=1:3)
f2 <- function(k) {k[1,1] <- 10}
f2(k)
k
  a
1 1
2 2
3 3
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • 1
    If I add tracemem(e$k), and then call f(e)... seems that df is being copied, is it not? – ndr Nov 04 '13 at 16:32
  • Good answer. Note that @MathhewPlourde is being careful in is wording of "to functions that modify them"--functions that don't modify arguments generally don't wind up copying the objects thanks to delayed evaluation. Also, if you've got a `data.frame` and want this behavior, you're likely better off using the `data.table` package. Finally, if you have a random object and want this behavior (copy by reference), consider [Reference Classes](http://stackoverflow.com/questions/5137199/what-is-the-significance-of-the-new-reference-classes). – Ari B. Friedman Nov 04 '13 at 16:38
  • @Matthew great, thanks for confirming that. where would savings come from then? i'll post a comparison in an answer... – ndr Nov 04 '13 at 17:08
  • @andrei it shouldn't be copying anything, so I'm not sure why `tracemem` is showing something to the contrary. Unfortunately, I can't put my attention to investigating this at the moment. – Matthew Plourde Nov 04 '13 at 17:12
1

Lets compare two cases. With new environment:

e <- new.env()
e$k <- data.frame(a=1:1000000)
f <- function(e) {e$k[1,1] <- 10}
system.time({
    for(i in 1:1000) f(e)
})
head(e$k) 

  user  system elapsed 
  5.32    6.35   11.67 

Without new environment:

k <- data.frame(a=1:1000000)
f <- function(e) {e[1,1] <- 10;return(e);}
system.time({
    for(i in 1:1000) k <- f(k)
}) 
  user  system elapsed 
  5.07    6.82   11.89

not much of a difference...

ndr
  • 1,427
  • 10
  • 11