Not really an answer, but longer than a comment. Ben, this
fun0 = function(x) sum(x, gc())
defines a function that calculates the sum of "x and the value returned by gc()". This
fun1 = function(x) sum(x); gc()
defines a function that returns the sum of x. gc()
is run after the function is defined, but is not part of the function definition.
fun2 = function(x) {
result = sum(x)
gc()
result
}
defines a function that calculates the sum of x and saves it to a variable result
that exists inside the function. It then evaluates the function gc()
. It then returns the value contained in result
, i.e., the sum of x. It's worth comparing results in addition to times
test_case = 1:5
identical(sum(test_case), fun0(test_case)) # FALSE
identical(sum(test_case), fun1(test_case)) # TRUE, but no garbage collection
identical(sum(test_case), fun2(test_case)) # TRUE
Invoking gc()
in fun2
doesn't really accomplish anything, after the first time fun2
is evaluated. There is no memory that has been allocated but no longer associated with a symbol, so no garbage to collect. Here's a case where we allocate some memory, use it, remove a reference to it, and then run the garbage collect to release the memory.
fun3 = function(x) {
m = rnorm(length(x))
result = sum(m * x)
rm(m)
gc()
result
}
BUT EXPLICIT GARBAGE COLLECTION DOES NOT DO ANYTHING USEFUL HERE -- the garbage collector automatically runs when R needs more memory than it has available. If fun3
has been invoked several times, then there will be memory used inside each invocation that is no longer referenced by a symbol, and hence will be collected when the garbage collector runs automatically. By invoking gc()
directly, you're asserting that your naive garbage collection strategy (do it all the time) is better than R's (do it when more memory is needed).
Which one might be able to do (write a better garbage collector).
But isn't the case here.
I mentioned that it often pays when confronted with performance or memory issues to step back and look at your algorithm and implementation. I know this is a 'toy' example, but let's look anyway. What you're calculating is the cumulative sum of the elements of x. I'd have written your implementation as
fun4 = function(i, x) sum(x[seq_len(i)])
sapply(seq_along(test_case), fun4, test_case)
which give
> x0 <- sapply(seq_along(test_case), fun4, test_case)
> x0
[1] 1 3 6 10 15
But R has a function cumsum
that does this more efficiently in terms of both memory and speed.
> x1 <- cumsum(test_case)
> identical(x0, x1)
[1] TRUE
> test_case = seq_len(10000)
> system.time(x0 <- sapply(seq_along(test_case), fun4, test_case))
user system elapsed
2.508 0.000 2.517
> system.time(x1 <- cumsum(test_case))
user system elapsed
0.004 0.000 0.002