1

What's the fastest way to do garbage collection in a lapply function or loop? What seems obvious to me slows things down enormously. Am I doing it wrong? Is there a quicker way?

x <- 1:10000
system.time(xx <- lapply(1:length(x), function(xi) sum(x[1:xi])))
 user  system elapsed 
   0.02    0.00    0.02 
system.time(xx <- lapply(1:length(x), function(xi) sum(x[1:xi], invisible(gc(v=FALSE)))))
   user  system elapsed 
  22.49    0.00   22.57 # a thousand times increase in time taken!!

In my actual use-case the function is a bit more complex and fails without gc after every instance. I could switch to a machine with more RAM, though it would be less convenient, so I'm curious if there a faster gc method is available.

UPDATE Following Martin Morgan's suggestion, rearranging things slightly brings the speed up to close to the lapply without the gc (now working on a different machine, which is why timings are different from above):

x <- 1:10000
system.time(x1 <- lapply(1:length(x), function(xi) sum(x[1:xi])))
   user  system elapsed 
   3.47    0.00    3.56 
# define a function to make a sequence of a function followed by gc
sum_gc <- function(x) sum(x); invisible(gc(v=FALSE))
system.time(x3 <- lapply(1:length(x), function(xi) sum_gc(x[1:xi])))
   user  system elapsed 
   3.52    0.02    3.56 
Ben
  • 41,615
  • 18
  • 132
  • 227
  • 1
    Do you really need to call `gc` at every iteration? It might be sufficient to call it every 100 steps or something like that. Maybe you can also optimize your function to need less RAM? Of course if you can get more RAM, that's not a bad idea. – Roland Apr 20 '13 at 18:48
  • 4
    one rarely needs to call `gc` directly! I don't think it does anything at all for memory use in the present example. And the syntax is weird -- you're providing the result of the call to gc as an argument to sum! Often the solution to memory and performance issues is a better algorithm rather than a bigger machine. – Martin Morgan Apr 20 '13 at 19:21

1 Answers1

4

Not really an answer, but longer than a comment. Ben, this

fun0 = function(x) sum(x, gc())

defines a function that calculates the sum of "x and the value returned by gc()". This

fun1 = function(x) sum(x); gc()

defines a function that returns the sum of x. gc() is run after the function is defined, but is not part of the function definition.

fun2 = function(x) {
    result = sum(x)
    gc()
    result
}

defines a function that calculates the sum of x and saves it to a variable result that exists inside the function. It then evaluates the function gc(). It then returns the value contained in result, i.e., the sum of x. It's worth comparing results in addition to times

test_case = 1:5
identical(sum(test_case), fun0(test_case))  # FALSE
identical(sum(test_case), fun1(test_case))  # TRUE, but no garbage collection
identical(sum(test_case), fun2(test_case))  # TRUE

Invoking gc() in fun2 doesn't really accomplish anything, after the first time fun2 is evaluated. There is no memory that has been allocated but no longer associated with a symbol, so no garbage to collect. Here's a case where we allocate some memory, use it, remove a reference to it, and then run the garbage collect to release the memory.

fun3 = function(x) {
   m = rnorm(length(x))
   result = sum(m * x)
   rm(m)
   gc()
   result
}

BUT EXPLICIT GARBAGE COLLECTION DOES NOT DO ANYTHING USEFUL HERE -- the garbage collector automatically runs when R needs more memory than it has available. If fun3 has been invoked several times, then there will be memory used inside each invocation that is no longer referenced by a symbol, and hence will be collected when the garbage collector runs automatically. By invoking gc() directly, you're asserting that your naive garbage collection strategy (do it all the time) is better than R's (do it when more memory is needed).

Which one might be able to do (write a better garbage collector).

But isn't the case here.

I mentioned that it often pays when confronted with performance or memory issues to step back and look at your algorithm and implementation. I know this is a 'toy' example, but let's look anyway. What you're calculating is the cumulative sum of the elements of x. I'd have written your implementation as

fun4 = function(i, x) sum(x[seq_len(i)])
sapply(seq_along(test_case), fun4, test_case)

which give

> x0 <- sapply(seq_along(test_case), fun4, test_case)
> x0
[1]  1  3  6 10 15

But R has a function cumsum that does this more efficiently in terms of both memory and speed.

> x1 <- cumsum(test_case)
> identical(x0, x1)
[1] TRUE
> test_case = seq_len(10000)
> system.time(x0 <- sapply(seq_along(test_case), fun4, test_case))
   user  system elapsed 
  2.508   0.000   2.517 
> system.time(x1 <- cumsum(test_case))
   user  system elapsed 
  0.004   0.000   0.002 
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Yes, thanks I'm familiar with what you say about the automatic nature of `gc` and the debates here: http://stackoverflow.com/q/1467201/1036500 Thanks for correcting my function and taking the time to explain. My use-case is like `library(topicmodels); data(AssociatedPress); lapply(seq(20,40,10), function(d) LDA(AssociatedPress[1:20,], d))`, building a list of models with different numbers of topics. With a small number of topics it's fine, but as I gradually scale up (1000s of docs, etc) things grind to a halt (or as good as, I get impatient after a month of machine time...). – Ben Apr 21 '13 at 07:35
  • @Ben calling `gc()` will __not__ help with that. – hadley Apr 21 '13 at 13:26
  • So more RAM/swap is my best option here? – Ben Apr 21 '13 at 15:32
  • 1
    No, rethink what you're trying to do. Do more work in the function provided to lapply, to reduce the large objects returned by `LDA()` to whatever it is you want to do down-stream computations on. Maybe you're only interested in the log likelihood? Then `fun = function(d) logLik(LDA(AssociatedPress[1:20,], d))`. Each iteration creates and returns (continues to use memory) an object of a few bytes, not a few megabytes. This helps until the size of a single iteration of LDA(AssociatedPress[1:x,], d) consumes more time / memory than available. Then it's time to think of a better implementation. – Martin Morgan Apr 21 '13 at 15:44
  • Yes, the LL is my main goal here. That seems like a good idea, thanks, I'll give it a shot. – Ben Apr 21 '13 at 15:57