I did this
> x = 1:5
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> .Internal(inspect(x))
@3acfed60 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,3,6,10,15
where the @3acfed60
is the (shared) memory address. The key is NAM(1), which says that there's a single reference to x, hence no need to re-allocate on update.
R uses (currently, I think this will change in the next release) a version of reference counting where an R symbol is reference 0, 1, or more than 1 times; when an object is referenced more than once, its reference count can't be decremented (because 'more than one' could mean 3, hence no way to distinguish between 2 references and 3 references, hence no way to distinguish between one less than 2 and one less than 3). Any attempt at modification needs to duplicate.
Originally I forgot to load pryr and wrote my own address()
> address = function(x) .Internal(inspect(x))
which reveals an interesting problem
> x = 1:5
> address(x)
@4647128 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5
> x[] = cumsum(x)
> address(x)
@4647098 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,3,6,10,15
Notice NAM(2)
, which says that inside the function there are at least two references to x
, i.e., in the global environment, and in the function environment. So touching x
inside a function triggers future duplication, sort of a Heisenberg uncertainty principle. cumsum
(and .Internal
, and length
) are written in a way that allows reference without increment to NAMED; address()
should be revised to have similar behavior (this has now been fixed)
Hmm, when I dig a little deeper I see (I guess it's obvious, in retrospect) that what actually happens is that cumsum(x)
does allocate memory via an S-expression
> x = 1:5
> .Internal(inspect(x))
@3bb1cd0 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5
> .Internal(inspect(cumsum(x)))
@43919d0 13 INTSXP g0c3 [] (len=5, tl=0) 1,3,6,10,15
but the assignment x[] <-
associates the new memory with the old location (??). (This seems to be 'as efficient' as data.table, which apparently also creates an S-expression for cumsum, presumably because it's calling cumsum itself!) So mostly I've not been helpful in this answer...
It's not likely that the allocation per se causes performance problems, but rather garbage collection (gcinfo(TRUE)
to see these) of the no longer used memory. I find it useful to launch R with
R --no-save --quiet --min-vsize=2048M --min-nsize=45M
which starts with a larger memory pool hence fewer (initial) garbage collections. It would be useful to analyze your coding style to understand why you find this as the performance bottleneck.