7

My code looks as follows (it's a little bit simplified version compared to the orginal, but it still reflects the problem).

require(VGAM)

Median.sum  = vector(mode="numeric", length=75) 
AA.sum      = vector(mode="numeric", length=75)                                                    
BB.sum      = vector(mode="numeric", length=75)                   
Median      = array(0, dim=c(75 ,3)) 
AA          = array(0, dim=c(75 ,3))                                                    
BB          = array(0, dim=c(75 ,3))                              

y.sum     = vector(mode="numeric", length=100000)
y         = array(0, dim=c(100000,3))
b.size    = vector(mode="numeric", length=3) 
c.size    = vector(mode="numeric", length=3) 


for (h in 1:40)
{
  for (j in 1:75)
  {  
    for (i in 1:100000)
    {
      y.sum[i] = 0

      for (f in 1:3)
      {
        b.size[f] = rbinom(1, 30, 0.9)
        c.size[f] = 30 - rbinom(1, 30, 0.9) + 1
        y[i, f] = sum( rlnorm(b.size[f], 8.5, 1.9) ) + 
          sum( rgpd(c.size[f], 120000, 1870000, 0.158) )
        y.sum[i] = y.sum[i] + y[i, f]
      }
    }

    Median.sum[j] = median(y.sum)
    AA.sum[j] = mean(y.sum)
    BB.sum[j] = quantile(y.sum, probs=0.85)

    for (f in 1:3)
    {
      Median[j,f] = median(y[,f])
      AA[j,f] = mean(y[,f])
      BB[j,f] = quantile(y[,f], probs=0.85)
    }
  }
  #gc()
}

It breaks in the middle of it's execution (h=7, j=1, i=93065) with an error:

Error: cannot allocate vector of size 526.2 Mb

Just after getting this message I've read this, this & this, but it's still not enough. The thing is, that neither garbage collector (gc()), nor clearing all the objects from the workspace helps. I mean that I've tried to put in my code both: garbage collector and operation removing all the variabes and declaring them once again within the loop (take a look at the place where #gc() is - however the latter is not included in the code I've posted).

It seems strange to me as all the procedure uses the same objects in each step of the loop (=> and should consume the same volume of memory within each step of the loop). Why the memory consumption increases over time?

To make the matter worst, if I want to work in the same session of R and even perform:

rm(list=ls())
gc()

I still get the same error message, even if I want to declare something minor like:

abc = array(0, dim=c(10,3))

Only closing R and starting new session helps. Why? Maybe there is some way to recode my loop?

R: 2.15.1 (32-bit), OS: Windows XP (32-bit)

I am quite new here so every tip appreciated! Thanks in advance.


Edit: (From Arun). I find this behaviour even easier to reproduce just with a simple example. Start a new R session and copy and paste this code and watch the memory grow in your system monitor.

mm <- rep(0, 1e4) # initialise a vector
for (i in 1:1e3) {
    for (j in 1:1e3) {
        for (k in 1:1e4) {
            mm[k] <- k # already pre-allocated
         }
    }
}
Community
  • 1
  • 1
brunner
  • 107
  • 8
  • Where does `rgpd` come from? Memory usage increases in time because you're growing your `y.sum`. – Roman Luštrik Mar 24 '13 at 11:34
  • @ Roman: rgpd draws a random value from generalized pareto distribution. Is y.sum really growing? It's declared at the very beginning of the code. – brunner Mar 24 '13 at 11:40
  • Oh, I didn't see the top of your code. You've pre-allocated your objects. I take it back. Hum. – Roman Luštrik Mar 24 '13 at 11:46
  • Josh, (+1) very nice post. – Arun Mar 24 '13 at 12:13
  • @Hemmo, the problem is reproducible (the memory keeps increasing) even if you replace all RHS with dummy values. No need of rbinom or any package here... I see in my system monitor that the memory just keeps increasing... And he's using a 32-bit R! – Arun Mar 24 '13 at 12:36
  • @ Hemmo: Yes, each time I overwrite my values. Two inner loops constitute my Monte Carlo simulation based on internal data. As for the most outer loop (h in 1:40) - I want to repeat my MC simulation for previous 40 months. In the orginal loop change of "h" in the outer loop makes the inner procedure to use different sets of data kept in external CSV files. I know that the code posted above looks stupidly as in each step it makes exactly the same operation, but in the orginal code it makes sense. – brunner Mar 24 '13 at 12:36
  • @ Arun: What does 32-bit R in this case matter? – brunner Mar 24 '13 at 12:40
  • It takes 870MB with `for (i in 1:1000)` (with all dummy values)... – Arun Mar 24 '13 at 12:40
  • 1
    @brunner [Very bluntly](http://windows.microsoft.com/is-is/windows-vista/32-bit-and-64-bit-windows-frequently-asked-questions), "The terms 32-bit and 64-bit refer to the way a computer's processor (also called a CPU), handles information. The 64-bit version of Windows handles large amounts of random access memory (RAM) more effectively than a 32-bit system" – Arun Mar 24 '13 at 12:43
  • In general, the memory allocation and handling in 64-bit is better than 32-bit. I mean that you run into the trouble of "out-of-memory" sooner than, say, one who runs wiht 64-bit. – Arun Mar 24 '13 at 12:45
  • @ Arun: I'm sorry, but I don't catch.. 870MB of RAM to perform simple loop of 1000 steps? EDIT: Thanks for the post above! – brunner Mar 24 '13 at 12:46
  • @brunner! yes! it seems crazy!! Check my edit in your post. – Arun Mar 24 '13 at 12:55
  • @brunner, the 870MB was for your code's same nested loop, but the `i` value replaced to 1000 instead of 100000. – Arun Mar 24 '13 at 12:59
  • Seems like not everyone is able to reproduce this problem... – Arun Mar 24 '13 at 13:47
  • @Arun: Just tried your example in an R3.0.0 prerelease from January, under Win 7 both 32 and 64 bit versions. I couldn't reproduce the memory problems. – Richie Cotton Mar 24 '13 at 14:14
  • Guys, what does RHS mean? Thanks. – brunner Mar 24 '13 at 14:15
  • @brunner, right hand side. – Arun Mar 24 '13 at 14:47
  • @RichieCotton, yes, seems so. Roman and some others weren't able to reproduce as well (on the R-public chat room). I'm trying to find out the issue. I run 2.15.3 64bit (with Rstudio) on Mac 10.8.3. – Arun Mar 24 '13 at 14:48
  • Am I correct - was it simply a bug in the previous version of R? – brunner Mar 24 '13 at 15:42
  • Potentially, at the point your loop stops, if R is not overwriting objects, but creating new ones, the minimum memory footprint you will require is ~ `( 1e5*48*75*6+(93065*48*75) ) / 1e6 Mb` RAM requirement, which is ~ 2.5Gb, which is nearly the limit of addressable RAM in 32bit R. At what point does R GC objects? It's explained in the R Internals document which I don't yet fully understand. – Simon O'Hanlon Mar 24 '13 at 15:46
  • I just installed R 3.0.0 beta and the same memory growth happens also here. Using standard 64bit Rgui, 64bit Windows 7. – Jouni Helske Mar 24 '13 at 17:26
  • 1
    I can reproduce the problem with the example by @Arun. I'm using R 2.15.3 64bit on Linux with RStudio. It is possible to observe step-wise memory jumps for the process `rstudio`: `170M, 194M, 224M, 259M, 301M, 347M, 408M, ...` – djhurio Mar 24 '13 at 17:27
  • @Arun If you add a call to `gc()` right after the `mm[k] <- k` line, the memory no longer grows. Seems the garbage collector isn't being run during the loop. – Matthew Lundberg Mar 24 '13 at 17:35
  • This doesn't happen if you change the code of @Arun so that you create the whole object again by using `mm <- rep(0, 1e4) ` in the innermost loop. – Jouni Helske Mar 24 '13 at 18:57

2 Answers2

4

Add a call to gc() within the for (i in 1:100000) loop.

Adding a call to gc() within the tight loop of Arun's code removes its memory growth.

This shows memory growth:

mm <- rep(0, 1e4) # initialise a vector
for (i in 1:1e3) {
    for (j in 1:1e3) {
        for (k in 1:1e4) {
            mm[k] <- k # already pre-allocated
         }
     }
 }

This does not:

mm <- rep(0, 1e4) # initialise a vector
for (i in 1:1e3) {
    for (j in 1:1e3) {
        for (k in 1:1e4) {
            mm[k] <- k # already pre-allocated
            gc()
         }
     }
 }

Something is awry with the automatic garbage collection here. The collector is being called in the first case, as gcinfo(TRUE) indicates. But yet the memory grows very quickly.

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
  • I was just testing the same thing, this seems to work, but it doesn't answer the question why doesn't the automatic garbage collection work here? Also, if you first run the code without `gc()` for some time, and then run the second code, it still doesn't remove the memory reserved in the earlier code. – Jouni Helske Mar 24 '13 at 17:43
  • @ Matthew: Thanks, I'll try to put gc() in the "very middle" loop. I'll let you know about the results. But still - Hemmo is right. – brunner Mar 24 '13 at 17:49
  • It also shouldn't make a difference in Arun's code, but it does. – Matthew Lundberg Mar 24 '13 at 17:50
  • @ Matthew: Probably you're right, however after putting gc() in the "very middle" loop, the procedure is so time consuming that it's completely useless. – brunner Mar 25 '13 at 07:54
2

This seems to work (putting innermost loop into a function). I did not run it till the end because it was to slow, but I did not notice memory inflation like in your code.

require(VGAM)

Median.sum  = vector(mode="numeric", length=75) 
AA.sum      = vector(mode="numeric", length=75)                                                    
BB.sum      = vector(mode="numeric", length=75)                   
Median      = array(0, dim=c(75 ,3)) 
AA          = array(0, dim=c(75 ,3))                                                    
BB          = array(0, dim=c(75 ,3))                              


inner.fun <- function() {
  y.sum     = vector(mode="numeric", length=100000)
  y         = array(0, dim=c(100000,3))
  b.size    = vector(mode="numeric", length=3) 
  c.size    = vector(mode="numeric", length=3) 
  for (i in 1:100000)
    {
      y.sum[i] = 0

      for (f in 1:3)
      {
        b.size[f] = rbinom(1, 30, 0.9)
        c.size[f] = 30 - rbinom(1, 30, 0.9) + 1
        y[i, f] = sum( rlnorm(b.size[f], 8.5, 1.9) ) + 
          sum( rgpd(c.size[f], 120000, 1870000, 0.158) )
        y.sum[i] = y.sum[i] + y[i, f]
      }
    }
    list(y.sum, y)
}

for (h in 1:40)
{
  cat("\nh =", h,"; j = ")
  for (j in 1:75)
  {  
    cat(j," ")
    result = inner.fun()
    y.sum = result[[1]]
    y = result[[2]]
    Median.sum[j] = median(y.sum)
    AA.sum[j] = mean(y.sum)
    BB.sum[j] = quantile(y.sum, probs=0.85)

    for (f in 1:3)
    {
      Median[j,f] = median(y[,f])
      AA[j,f] = mean(y[,f])
      BB[j,f] = quantile(y[,f], probs=0.85)
    }
  }
}