10

This question sounds to be partially answered here but this is not enough specific to me. I would like to understand better when an object is updated by reference and when it is copied.

The simpler example is vector growing. The following code is blazingly inefficient in R because the memory is not allocated before the loop and a copy is made at each iteration.

  x = runif(10)
  y = c() 

  for(i in 2:length(x))
    y = c(y, x[i] - x[i-1])

Allocating the memory enable to reserve some memory without reallocating the memory at each iteration. Thus this code is drastically faster especially with long vectors.

  x = runif(10)
  y = numeric(length(x))

  for(i in 2:length(x))
    y[i] = x[i] - x[i-1]

And here comes my question. Actually when a vector is updated it does move. There is a copy that is made as shown below.

a = 1:10
pryr::tracemem(a)
[1] "<0xf34a268>"
a[1] <- 0L
tracemem[0xf34a268 -> 0x4ab0c3f8]:
a[3] <-0L
tracemem[0x4ab0c3f8 -> 0xf2b0a48]:  

But in a loop this copy does not occur

y = numeric(length(x))
for(i in 2:length(x))
{
   y[i] = x[i] - x[i-1]
   print(address(y))
}

Gives

[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0" 

I understand why a code is slow or fast as a function of the memory allocations but I don't understand the R logic. Why and how, for the same statement, in a case the update is made by reference and in the other case the update in made by copy. In the general case how can we know what will happen.

JRR
  • 3,024
  • 2
  • 13
  • 37

2 Answers2

9

This is covered in Hadley's Advanced R book. In it he says (paraphrasing here) that whenever 2 or more variables point to the same object, R will make a copy and then modify that copy. Before going into examples, one important note which is also mentioned in Hadley's book is that when you're using RStudio

the environment browser makes a reference to every object you create on the command line.

Given your observed behavior, I'm assuming you're using RStudio which we will see will explain why there are actually 2 variables pointing to a instead of 1 like you might expect.

The function we'll use to check how many variables are pointing to an object is refs(). In the first example you posted you can see:

library(pryr)
a = 1:10
refs(x)
#[1] 2

This suggests (which is what you found) that 2 variables are pointing to a and thus any modification to a will result in R copying it, then modifying that copy.

Checking the for loop we can see that y always has the same address and that refs(y) = 1 in the for loop. y is not copied because there are no other references pointing to y in your function y[i] = x[i] - x[i-1]:

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1" 

On the other hand if introduce a non-primitive function of y in your for loop you would see that address of y changes each time which is more in line with what we would expect:

is.primitive(lag)
#[1] FALSE

for(i in 2:length(x))
{
  y[i] = lag(y)[i]
  print(c(address(y), refs(y)))
}

#[1] "0x19b31600" "1"         
#[1] "0x19b31948" "1"         
#[1] "0x19b2f4a8" "1"         
#[1] "0x19b2d2f8" "1"         
#[1] "0x19b299d0" "1"         
#[1] "0x19b1bf58" "1"         
#[1] "0x19ae2370" "1"         
#[1] "0x19a649e8" "1"         
#[1] "0x198cccf0" "1"  

Note the emphasis on non-primitive. If your function of y is primitive such as - like: y[i] = y[i] - y[i-1] R can optimize this to avoid copying.

Credit to @duckmayr for helping explain the behavior inside the for loop.

Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • This might have something to do with the value of the `named` field of the underlying object being set to 1, as in the situation described in [Section 1.1.2 of the R Internals Manual](https://cran.r-project.org/doc/manuals/r-patched/R-ints.html#Rest-of-header), "where in principle two copies of [a variable] exist for the duration of the computation... *but for no longer*" (emphasis added). – duckmayr Jan 12 '18 at 17:24
  • @duckmayr, I think you're right. If I use a non-primitive function it copies then modifies – Mike H. Jan 12 '18 at 17:32
  • Ok I understand the point related to Rstudio that create two variables pointing to the same object. However I don't understand the loop. If `=` is a primitive that can be optimized then `y[i] = lag(y)[i]` should also be optimized? – JRR Jan 12 '18 at 17:58
  • @JRR good catch, I'm assuming it's because - is a primitive. Just updated my answer – Mike H. Jan 12 '18 at 18:04
  • @MikeH. I think I got it. You wrote `lag(y)` not `lag(x)`. `lag(y)` create a second ref into `lag`. At the update time there are 2 refs. But at the print time there is only 1 ref remaining since the `y` into `lag` does not exist anymore. If you change your code for `lag(x)[i]` there are no longer any copy. The loop is optimized and Rstudio does not create extra refs in the loop. Is that possible? – JRR Jan 12 '18 at 18:22
  • @MikeH. also I ran the code in a terminal without Rstudio and `a[1] <- 2`created a copy... even with `refs(a)` equal to 1 – JRR Jan 12 '18 at 18:25
  • @JRR, apologies again for the update. I think the way it works is that your original for loop did not create an extra reference to `y` so it was not copied. If you use a non-primitive like `lag(y)` this creates an extra reference to `y` so it is copied. On the other hand if you use a primitive like `y[i] - y[i-1]`, this is optimized and does not copy `y`. – Mike H. Jan 12 '18 at 19:01
  • 1
    @JRR, try `a[1] <- 2L` – Mike H. Jan 12 '18 at 19:16
1

I complete the @MikeH. awnser with this code

library(pryr)

x = runif(10)
y = numeric(length(x))
print(c(address(y), refs(y)))

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

print(c(address(y), refs(y)))

The output shows clearly what happened

[1] "0x7872180" "2"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1" 
[1] "0x765b860" "2"  

There is a copy at the first iteration. Indeed because of Rstudio there are 2 refs. But after this first copy y belongs in the loops and is not available into the global environment. Then, Rstudio does not create any additional refs and thus no copy is made during the next updates. y is updated by reference. On loop exit y become available in the global environment. Rstudio creates an extra refs but this action does not change the address obviously.

JRR
  • 3,024
  • 2
  • 13
  • 37