Copy-on-modify semantic on a vector does not append in a loop. Why?

Question

This question sounds to be partially answered here but this is not enough specific to me. I would like to understand better when an object is updated by reference and when it is copied.

The simpler example is vector growing. The following code is blazingly inefficient in R because the memory is not allocated before the loop and a copy is made at each iteration.

  x = runif(10)
  y = c() 

  for(i in 2:length(x))
    y = c(y, x[i] - x[i-1])

Allocating the memory enable to reserve some memory without reallocating the memory at each iteration. Thus this code is drastically faster especially with long vectors.

  x = runif(10)
  y = numeric(length(x))

  for(i in 2:length(x))
    y[i] = x[i] - x[i-1]

And here comes my question. Actually when a vector is updated it does move. There is a copy that is made as shown below.

a = 1:10
pryr::tracemem(a)
[1] "<0xf34a268>"
a[1] <- 0L
tracemem[0xf34a268 -> 0x4ab0c3f8]:
a[3] <-0L
tracemem[0x4ab0c3f8 -> 0xf2b0a48]:

But in a loop this copy does not occur

y = numeric(length(x))
for(i in 2:length(x))
{
   y[i] = x[i] - x[i-1]
   print(address(y))
}

Gives

[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"
[1] "0xe849dc0"

I understand why a code is slow or fast as a function of the memory allocations but I don't understand the R logic. Why and how, for the same statement, in a case the update is made by reference and in the other case the update in made by copy. In the general case how can we know what will happen.

"a" is integer and you are assigning a "double" (0); hence the copy made (an "integer" to "double" vector coercion) — alexis_laz, Jan 12 '18 at 19:18

Mike H. · Answer 1 · 2018-01-12T18:58:37.463

This is covered in Hadley's Advanced R book. In it he says (paraphrasing here) that whenever 2 or more variables point to the same object, R will make a copy and then modify that copy. Before going into examples, one important note which is also mentioned in Hadley's book is that when you're using RStudio

the environment browser makes a reference to every object you create on the command line.

Given your observed behavior, I'm assuming you're using RStudio which we will see will explain why there are actually 2 variables pointing to a instead of 1 like you might expect.

The function we'll use to check how many variables are pointing to an object is refs(). In the first example you posted you can see:

library(pryr)
a = 1:10
refs(x)
#[1] 2

This suggests (which is what you found) that 2 variables are pointing to a and thus any modification to a will result in R copying it, then modifying that copy.

Checking the for loop we can see that y always has the same address and that refs(y) = 1 in the for loop. y is not copied because there are no other references pointing to y in your function y[i] = x[i] - x[i-1]:

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"         
#[1] "0x19c3a230" "1"

On the other hand if introduce a non-primitive function of y in your for loop you would see that address of y changes each time which is more in line with what we would expect:

is.primitive(lag)
#[1] FALSE

for(i in 2:length(x))
{
  y[i] = lag(y)[i]
  print(c(address(y), refs(y)))
}

#[1] "0x19b31600" "1"         
#[1] "0x19b31948" "1"         
#[1] "0x19b2f4a8" "1"         
#[1] "0x19b2d2f8" "1"         
#[1] "0x19b299d0" "1"         
#[1] "0x19b1bf58" "1"         
#[1] "0x19ae2370" "1"         
#[1] "0x19a649e8" "1"         
#[1] "0x198cccf0" "1"

Note the emphasis on non-primitive. If your function of y is primitive such as - like: y[i] = y[i] - y[i-1] R can optimize this to avoid copying.

Credit to @duckmayr for helping explain the behavior inside the for loop.

This might have something to do with the value of the `named` field of the underlying object being set to 1, as in the situation described in [Section 1.1.2 of the R Internals Manual](https://cran.r-project.org/doc/manuals/r-patched/R-ints.html#Rest-of-header), "where in principle two copies of [a variable] exist for the duration of the computation... *but for no longer*" (emphasis added). — duckmayr, Jan 12 '18 at 17:24
@duckmayr, I think you're right. If I use a non-primitive function it copies then modifies — Mike H., Jan 12 '18 at 17:32
Ok I understand the point related to Rstudio that create two variables pointing to the same object. However I don't understand the loop. If `=` is a primitive that can be optimized then `y[i] = lag(y)[i]` should also be optimized? — JRR, Jan 12 '18 at 17:58
@JRR good catch, I'm assuming it's because - is a primitive. Just updated my answer — Mike H., Jan 12 '18 at 18:04
@MikeH. I think I got it. You wrote `lag(y)` not `lag(x)`. `lag(y)` create a second ref into `lag`. At the update time there are 2 refs. But at the print time there is only 1 ref remaining since the `y` into `lag` does not exist anymore. If you change your code for `lag(x)[i]` there are no longer any copy. The loop is optimized and Rstudio does not create extra refs in the loop. Is that possible? — JRR, Jan 12 '18 at 18:22
@MikeH. also I ran the code in a terminal without Rstudio and `a[1] <- 2`created a copy... even with `refs(a)` equal to 1 — JRR, Jan 12 '18 at 18:25
@JRR, apologies again for the update. I think the way it works is that your original for loop did not create an extra reference to `y` so it was not copied. If you use a non-primitive like `lag(y)` this creates an extra reference to `y` so it is copied. On the other hand if you use a primitive like `y[i] - y[i-1]`, this is optimized and does not copy `y`. — Mike H., Jan 12 '18 at 19:01

score 1 · Accepted Answer · answered Jan 12 '18 at 20:42

I complete the @MikeH. awnser with this code

library(pryr)

x = runif(10)
y = numeric(length(x))
print(c(address(y), refs(y)))

for(i in 2:length(x))
{
  y[i] = x[i] - x[i-1]
  print(c(address(y), refs(y)))
}

print(c(address(y), refs(y)))

The output shows clearly what happened

[1] "0x7872180" "2"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1"        
[1] "0x765b860" "1" 
[1] "0x765b860" "2"

There is a copy at the first iteration. Indeed because of Rstudio there are 2 refs. But after this first copy y belongs in the loops and is not available into the global environment. Then, Rstudio does not create any additional refs and thus no copy is made during the next updates. y is updated by reference. On loop exit y become available in the global environment. Rstudio creates an extra refs but this action does not change the address obviously.

Copy-on-modify semantic on a vector does not append in a loop. Why?

2 Answers2

Linked