8

I'm using nested foreach from the doSMP package to generate results based on a function I developed. Ordinarily the problem would use three nested loops, but due to the size of results generated (around 80,000 for each i), I've had to pause compilation and write the results to file when the final results matrix exceeds a specified number of rows.

i = 1
write.off = 1

while(i <= length(i.vector)){
        results.frame = as.data.frame(matrix(NA, ncol = 3, nrow = 1))

        while(nrow(results.frame) < 500000 & i <= length(i.vector)){
                results = foreach(j = 1:length(j.vector), .combine = "rbind", .inorder = TRUE) %:%
                foreach(k = 1:length(k.vector), .combine = "rbind", .inorder = TRUE) %dopar%{

                        ith.value = i.vector[i]
                        jth.value = j.vector[j]
                        kth.value = k.vector[k]
                        my.function(ith.value, jth.value, kth.value)
                }

                results.frame = rbind(results.frame, results)
                i = i + 1
        }

        results.frame = results.frame[-1,]
        write.table(results.frame, paste("part_",write.off, sep = ""))
        write.off = write.off + 1   
}

The problem I'm having is with garbage collection. The workers don't seem to reallocate memory back to the system, so by i = 4 they each have eaten up around 6GB of memory.

I've tried inserting gc() into the foreach loop directly as well as into the underlying function, and I've also tried assigning the function and its results to a named environment that I can clear periodically. None of these methods have worked.

I feel like foreach's initEnvir and finalEnvir parameters might offer a solution, but the documentation and examples haven't really shed much light on this.

I'm running this code on a VM operating Windows Server 2008.

MrT
  • 704
  • 1
  • 8
  • 21
  • 4
    If you know how many rows you're going to fill (i.e. length(i.vector)), you can save a lot of time and memory by setting up results.frame once. Doing rbind or other things each time thru takes a lot of cpu effort. Take a look at vectorizing better. Also: you really should use `<-` rather than `=` . Trust us :-) – Carl Witthoft Nov 03 '11 at 15:11
  • 1
    I'd also point out that since the outer loop is `i <= length(i.vector)` you have no use for that same conditional in the inner loop. Take some time to figure out what you really want to do here. – Carl Witthoft Nov 03 '11 at 15:15
  • The outer while loop is to keep the code running until it hits the very last element in i.vector (there are over 1.8M). The reason for the second loop is to break off computation and save the results periodically. The i<=length(i.vector) in the second while loop is just a sanity check, in case the very last results.frame where is under 500K row elements but i = length(i.vector). – MrT Nov 03 '11 at 15:38
  • I have the same issue and occasionally I have to kill all my R processes because they are eating the server RAM (512 Gb). – Matteo De Felice Apr 17 '14 at 08:45

1 Answers1

0

You might consider avoiding this issue altogether by writing a different loop.

Consider using the gen.factorial function in AlgDesign, a la:

fact1 = gen.factorial(c(length(i.vector), length(j.vector), length(k.vector)), nVars = 3, center = FALSE)
foreach(ix_row = 1:nrow(fact1)) %dopar% {
  my.function(fact1[ix_row,])
}

You could also use memory mapped files and pre-allocate the output storage using bigmemory (assuming you're creating a matrix) and that would make it feasible for each worker to store its output on its own.

In this way, your overall memory usage should drop dramatically.


Update 1: It seems that memory issues are endemic to doSMP. Check out the following posts:

I recall seeing another memory issue for doSMP, either on as a question or in the R chat, but I can't seem to recover the post.

Update 2: I don't know if this will help, but you might try using an explicit return() (e.g. return(my.function(ith.value, jth.value, kth.value))). In my code, I generally use an explicit return() for clarity.

Community
  • 1
  • 1
Iterator
  • 20,250
  • 12
  • 75
  • 111
  • 1
    Thanks for the advice, but my problem is with the memory that the workers (the Rterm processes) use up. The issue applies to foreach in general. The only way I've been able to get the workers to release memory is through stopWorkers(). – MrT Nov 03 '11 at 21:11
  • I wonder if that's a problem on Windows & `doSMP`. I've not noticed any memory leaks like this in my work on Linux, using doMC. Btw, `foreach` is a separate package, for which `doSMP` and `doMC` are backends. – Iterator Nov 03 '11 at 21:50
  • Now that I think about it, others have mentioned memory problems with doSMP. I'll amend my answer to link to them. – Iterator Nov 03 '11 at 21:52
  • Thanks for the followup Iterator. I looked through the posts and tried using an explicit return, but alas, to no avail. – MrT Nov 04 '11 at 13:50
  • Maybe this should be brought up with Revolution (i.e. the maintainers)? If you get a solution, please post! :) – Iterator Nov 04 '11 at 13:55