6

Some R function will make R copy the object AFTER the function call, like nrow, while some others don't, like sum. For example the following code:

x = as.double(1:1e8)
system.time(x[1] <- 100)
y = sum(x)
system.time(x[1] <- 200)     ## Fast (takes 0s), after calling sum
foo = function(x) {
    return(sum(x))
}
y = foo(x)
system.time(x[1] <- 300)     ## Slow (takes 0.35s), after calling foo

Calling foo is NOT slow, because x isn't copied. However, changing x again is very slow, as x is copied. My guess is that calling foo will leave a reference to x, so when changing it after, R makes another copy.

Any one knows why R does this? Even when the function doesn't change x at all? Thanks.

user7648269
  • 111
  • 1
  • I am using RevolutionR Open 3.1.2. Here is my results: user system elapsed 0 0 0 user system elapsed 0 0 0 user system elapsed 0.20 0.13 0.33 – user7648269 Mar 02 '17 at 15:37
  • Revolution R doesn't exist any more... might want to upgrade. Fwiw, I see the same thing you do on vanilla R 3.2.5. – Frank Mar 02 '17 at 15:40
  • Any idea why R does this? It seems no reason to make a copy when changing it AFTER the function call. Thanks. – user7648269 Mar 02 '17 at 15:43
  • This may be related: https://developer.r-project.org/Refcnt.html I don't know C or the R internals well enough to know how to determine the # references to an object at a given time. – Frank Mar 02 '17 at 15:46
  • `y = foo(x)` incremented the named attribute of `x` from 1 to 2, which forces the subsequent copy. Probably because R can't know for sure what side effects `foo` may have had. _You_ know that `foo` won't do anything strange, but it may be nearly impossible for R to know that in general. – joran Mar 02 '17 at 15:49
  • I can generate positive system time with `x = as.double(1:1e8); system.time(x[1] <- 100); system.time(x[1] <- 100); system.time(x[1] <- 100);`. with MS R open 3.2.5. Also, wrapping `x[1] <- 100` in `tracemem` indicates copies are being made. – lmo Mar 02 '17 at 15:55
  • 1
    @Frank `.Internal(inspect(x))` will tell you what the NAMED property is for an object, but I always have to remember to not do that stuff in RStudio, because for some reason the way it handles R everything comes back with NAMED of 2. – joran Mar 02 '17 at 15:55
  • 1
    Regarding @joran's comment, "_not do that stuff in RStudio_": [Operator “[<-” in RStudio and R](http://stackoverflow.com/questions/15559387/operator-in-rstudio-and-r). – Henrik Mar 02 '17 at 16:03
  • Here is a [related question](http://stackoverflow.com/questions/38766068/why-does-the-access-time-for-the-first-element-of-a-data-frame-depends-on-its-di/38766520#38766520). – lmo Mar 02 '17 at 16:09
  • 1
    @Frank Thanks for pointing out the link. It is very useful. Does anyone know when SWITCH_TO_REFCNT will be the default for compiling R? Or is there a compiled version of R for windows 10 with SWITCH_TO_REFCNT enabled? Thanks. – user7648269 Mar 02 '17 at 21:03

1 Answers1

0

I definitely recommend Hadley's Advanced R book, as it digs into some of the internals that you will likely find interesting and relevant. Most relevant to your question (and as mentioned by @joran and @lmo), the reason for the slow-down was an additional reference that forced copy-on-modify.

An excerpt that might be beneficial from Memory#Modification:

There are two possibilities:

  • R modifies x in place.

  • R makes a copy of x to a new location, modifies the copy, and then uses the name x to point to the new location.

It turns out that R can do either depending on the circumstances. In the example above, it will modify in place. But if another variable also points to x, then R will copy it to a new location. To explore what’s going on in greater detail, we use two tools from the pryr package. Given the name of a variable, address() will tell us the variable’s location in memory and refs() will tell us how many names point to that location.

Also of interest are the sections on R's C interface and Performance. The pryr package also has tools for working with these sorts of internals in an easier fashion.

One last note from Hadley's book (same Memory section) that might be helpful:

While determining that copies are being made is not hard, preventing such behaviour is. If you find yourself resorting to exotic tricks to avoid copies, it may be time to rewrite your function in C++, as described in Rcpp.

cole
  • 1,737
  • 2
  • 15
  • 21