2

In Python I can do this:

a = np.arange(100)
print id(a) # shows some number
a[:] = np.cumsum(a)
print(id(a)) # shows the same number

What I did here was to replace the contents of a with its cumsum. The address before and after is the same.

Now let's try it in R:

install.packages('pryr')
library(pryr)
a = 0:99
print(address(a)) # shows some number
a[1:length(a)] = cumsum(a)
print(address(a)) # shows a different number!

The question is how can I overwrite already-allocated memory in R with the results of computations? The lack of this sort of thing seems to be causing significant performance discrepancies when I do vector operations in R vs. Rcpp (writing code in C++ and calling it from R, which lets me avoid unnecessary allocations).

I'm using R 3.1.1 on Ubuntu Linux 10.04 with 24 physical cores and 128 GB of RAM.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Everything in R is a copy, except when you use `data.table` – James Aug 19 '14 at 10:00
  • 3
    @James, that isn't accurate. R has some rules of `copy-on-modify` when it avoids copies under certain circumstances. See [here](http://stackoverflow.com/questions/15759117/what-exactly-is-copy-on-modify-semantics-in-r-and-where-is-the-canonical-source) – David Arenburg Aug 19 '14 at 10:06
  • 2
    @James, try `a = 0:99 ;print(address(a)) ;b <- a ; print(address(b))` for example – David Arenburg Aug 19 '14 at 10:15
  • 3
    To add to David's point, R 3.1+ avoids deep copies and instead *shallow* copies in many places (which improves performance tremendously). – Arun Aug 19 '14 at 13:23

2 Answers2

6

Try the data.table package. It allows for updating values by reference using the := operator (as well as using the function set):

library(data.table)
A <- data.table(a = seq_len(99))

address(A)   # [1] "0x108d283f0"
address(A$a) # [1] "0x108e548a0"

options(datatable.verbose=TRUE)
A[, a := cumsum(a)]
# Detected that j uses these columns: a 
# Assigning to all 99 rows
# Direct plonk of unnamed RHS, no copy. <~~~ no copy of `A` or `A$a` is made.

address(A)   # [1] "0x108d283f0"
address(A$a) # [1] "0x1078f5070"

Note that even though the address of A$a is different after updating by reference, there's no copy being made here. It's different because it's a full column plonk - meaning the vector cumsum(a) replaces the current column a (by reference). (The address you see is the address of cumsum(a) basically).

Arun
  • 116,683
  • 26
  • 284
  • 387
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 2
    You're taking the address of `A`; check out the address of `A$a`. – Martin Morgan Aug 19 '14 at 13:32
  • 1
    @MartinMorgan, made the edit to clarify things. The address will be different because of full column plonk, but there's no *copy* being made. – Arun Aug 19 '14 at 15:29
  • Thanks @Arun. I also think that it still can be useful to show that `address(A)` hasn't changed (unlike, if for example, `A` was a `data.frame`) in order to illustrate your point. – David Arenburg Aug 19 '14 at 15:37
  • 1
    I guess data.table calls R's cumsum, which creates a new vector in both data.table and when called from the global environment, so using data.table is not helping with this aspect of memory management. An illustration of @Arun's point about improved memory management in base R is that if A were a data frame and had a second column b, b would not be duplicated (whereas in previous R versions it would have). – Martin Morgan Aug 19 '14 at 15:57
  • 1
    @MarginMorgan, not really. The RHS (`cumsum(a)`) from `j-expression` (`a := cumsum(a)`) is extracted and then evaluated. Then assigned back to `x`. A new vector is created just once (which can't be avoided in your case as well). However I do see an increase in memory usage compared to your answer, by the same amount as the size of 'a' (tested on quite large data ~1.5Gb). That is most likely a bug (as it has nothing to do with cumsum part at all.. there's an unnecessary copy of dt's columns happening). – Arun Aug 19 '14 at 16:44
4

I did this

> x = 1:5
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,3,6,10,15

where the @3acfed60 is the (shared) memory address. The key is NAM(1), which says that there's a single reference to x, hence no need to re-allocate on update.

R uses (currently, I think this will change in the next release) a version of reference counting where an R symbol is reference 0, 1, or more than 1 times; when an object is referenced more than once, its reference count can't be decremented (because 'more than one' could mean 3, hence no way to distinguish between 2 references and 3 references, hence no way to distinguish between one less than 2 and one less than 3). Any attempt at modification needs to duplicate.

Originally I forgot to load pryr and wrote my own address()

> address = function(x) .Internal(inspect(x))

which reveals an interesting problem

> x = 1:5
> address(x)
@4647128 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> address(x)
@4647098 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,3,6,10,15

Notice NAM(2), which says that inside the function there are at least two references to x, i.e., in the global environment, and in the function environment. So touching x inside a function triggers future duplication, sort of a Heisenberg uncertainty principle. cumsum (and .Internal, and length) are written in a way that allows reference without increment to NAMED; address() should be revised to have similar behavior (this has now been fixed)

Hmm, when I dig a little deeper I see (I guess it's obvious, in retrospect) that what actually happens is that cumsum(x) does allocate memory via an S-expression

> x = 1:5
> .Internal(inspect(x))
@3bb1cd0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> .Internal(inspect(cumsum(x)))
@43919d0 13 INTSXP g0c3 [] (len=5, tl=0) 1,3,6,10,15

but the assignment x[] <- associates the new memory with the old location (??). (This seems to be 'as efficient' as data.table, which apparently also creates an S-expression for cumsum, presumably because it's calling cumsum itself!) So mostly I've not been helpful in this answer...

It's not likely that the allocation per se causes performance problems, but rather garbage collection (gcinfo(TRUE) to see these) of the no longer used memory. I find it useful to launch R with

R --no-save --quiet --min-vsize=2048M --min-nsize=45M

which starts with a larger memory pool hence fewer (initial) garbage collections. It would be useful to analyze your coding style to understand why you find this as the performance bottleneck.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112