First bracketed assignment is as time-consuming as full assignment?

Question

Regarding this answer in: What exactly is copy-on-modify semantics in R, and where is the canonical source?

We can see that, at the first time a vector is altered with '[<-', R copies the entire vector even if only a single entry is to be modifed. At the second time, however, the vector is altered "in place". This is noticeable without inspecting the address of the objects if we measure the time to create and modify a large vector:

> system.time(a <- rep(1L, 10^8))
   user  system elapsed 
   0.15    0.17    0.31 
> system.time(a[222L] <- 111L)
   user  system elapsed 
   0.26    0.08    0.34 
> system.time(a[333L] <- 111L)
   user  system elapsed 
      0       0       0

Note that there is no change of type/storage.mode.

So the question is: why is it not possible to optimize the first bracket assignment as well? In what situation this kind of behaviour (full copy at first modification) is actually needed?

EDIT: (spoiler!) As explained in the accepted answer below, this is nothing but an artifact of enclosing the first assignment in a system.time function call. This causes R to mark the memory space bound to a as possibly referring to more than one symbol, thus requiring duplication when changed. If we remove the enclosing calls, the vector is modified in place from the very first bracket assignment.

Thanks Martin for in-depth solution!

Not answering your question per-se, but to add something to it: if you do `head(a)` before the last assignment, then you'd find that a copy is being made. Try this: `system.time(a <- rep(1L, 10^8))`; `tracemem(a)`; `system.time(a[222L] <- 111L)`; `head(a)`; `system.time(a[333L] <- 111L)` — Arun, May 07 '13 at 16:55

Martin Morgan · Accepted Answer · 2013-05-07T17:17:46.547

Compare the "NAM()" part of

> a <- rep(1L, 10)
> .Internal(inspect(a))
@457b840 13 INTSXP g0c4 [NAM(1)] (len=10, tl=0) 1,1,1,1,1,...

versus

> system.time(a <- rep(1L, 10))
[...]
> .Internal(inspect(a))
@4626f88 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,1,1,1,1,...

The "1" in the first example means that R thinks there is one reference to a, hence can be updated in place. The "2" means that R thinks there have been at least two references to a, hence duplication required if modified. Roughly, I rationalize this as the representation of the return value of rep() inside system.time, and its value outside system.time; the moral equivalent of f = function() { x <- rep(1L, 10); x }; a = f() rather than g = function() rep(1L, 10); a = g().

The real-world code a <- rep(1L, 10^8); a[123L] <- 231L would not involve a copy. We can time the assignment without artificially incrementing the NAMED count with

> a <- rep(1L, 10^8)
> .Internal(inspect(a))
@7f972b571010 13 INTSXP g0c7 [NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...
> system.time(a[123L] <- a[321L])
   user  system elapsed 
      0       0       0

Yes, very nice. Likely relevant is the section beginning with "However, if the computation is done by returning a named value from within a function" in [this post](https://stat.ethz.ch/pipermail/r-devel/2011-November/062654.html) from a very interesting R-devel thread on the topic of how/when NAMED's value gets set to 2. — Josh O'Brien, May 07 '13 at 17:54

Arun · Answer 2 · 2013-05-07T17:42:17.463

Edit following Joshua's comment: The behaviour shown below is limited to R-studio!!

To answer OP's question, the underlying reason for the copy (as @MartinMorgan explains) is due to the NAM(2) SEXP object for a. If the first command were to not include a system.time(.), then a <- rep(1, 10^8) returns a NAM(1) type which then results in no copy on both the assignments.

Observation in R-studio:

However, to point out another interesting observation/difference, if you were to run in R-studio, there's an additional behaviour difference (from that of R64/R32 session) you may not be aware of.

The difference (in R studio) seems, to stem from how you run your code. That is, if you copy and paste all at once (as shown below, including the output):

system.time(a <- rep(1L, 10^8))
#    user  system elapsed 
#   0.256   0.263   0.526 
.Internal(inspect(a))
# @10745d000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...
system.time(a[222L] <- 111L)
#    user  system elapsed 
#   0.299   0.199   0.498 
.Internal(inspect(a))
# @11f1d6000 13 INTSXP g0c7 [NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...
system.time(a[333L] <- 111L)
#    user  system elapsed 
#       0       0       0 
.Internal(inspect(a))
# @11f1d6000 13 INTSXP g1c7 [MARK,NAM(1)] (len=100000000, tl=0) 1,1,1,1,1,...

You see that the 2nd assignment doesn't involve a memory copy and time required is 0 seconds. Now, copy/paste/execute the same set of commands, but now one-by-one (hit enter after every line before typing the next). Here are the results:

system.time(a <- rep(1L, 10^8))
#    user  system elapsed 
#   0.256   0.265   0.588 
> 
.Internal(inspect(a))
# @10745d000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...

system.time(a[222L] <- 111L)
#    user  system elapsed 
#   0.302   0.204   0.559 

.Internal(inspect(a))
# @11f1d6000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...

system.time(a[333L] <- 111L)
#    user  system elapsed 
#   0.296   0.208   0.504 
> 
.Internal(inspect(a))
# @10745d000 13 INTSXP g0c7 [NAM(2)] (len=100000000, tl=0) 1,1,1,1,1,...

For the same syntax, here a copy is being made and the runtime is 0.5 seconds.

Now to explain the difference (as @MartinMorgan explained in his answer):

For the first case, being a NAM(2) SEXP object, it gets duplicated during assignment. However, this happens only once in the first case when you run all the lines at once. Also to note is that the second assignment has a MARK (unsigned int) which denotes "mark object as in use" (from the R-internals).

In the second case, in R-studio, hitting enter for every line results in each of these assignments returning back a NAM(2) SEXP object. And so, a copy is being made every single time.

I cannot replicate this in R-2.15.2 on 32-bit WinXP or R-2.15.3 on 64-bit Ubuntu. — Joshua Ulrich, May 07 '13 at 17:26
Yes, this is R-studio's doing. Explanation here: http://stackoverflow.com/questions/15559387/operator-in-rstudio-and-r/15559956#15559956 — Josh O'Brien, May 07 '13 at 17:36

First bracketed assignment is as time-consuming as full assignment?

2 Answers2

Edit following Joshua's comment: The behaviour shown below is limited to R-studio!!

Observation in R-studio:

Linked