R does not stop grabbing memory / RAM due to XML

Question

I have an double loop like the one shown below the problem is that R (2.15.2) is using more and more memory and I do not understand why.

While I understand that this has to happen within the inner cycle because of the rbind() I am doing there, I do not understand why R keeps grabbing memory when a new cycle of the outer loop starts and actually the objects ( 'xmlCatcher' ) are reused:

# !!!BEWARE this example creates a lot of files (n=1000)!!!!

require(XML)

chunk <- function(x, chunksize){
        # source: http://stackoverflow.com/a/3321659/1144966
        x2 <- seq_along(x)
        split(x, ceiling(x2/chunksize))
    }

chunky <- chunk(paste("test",1:1000,".xml",sep=""),100)

for(i in 1:1000){
writeLines(c(paste('<?xml version="1.0"?>\n <note>\n    <to>Tove</to>\n    <nr>',i,'</nr>\n    <from>Jani</from>\n    <heading>Reminder</heading>\n    ',sep=""), paste(rep('<body>Do not forget me this weekend!</body>\n',sample(1:10, 1)),sep="" ) , ' </note>')
,paste("test",i,".xml",sep=""))
}

for(k in 1:length(chunky)){
gc()
print(chunky[[k]])
xmlCatcher <- NULL

for(i in 1:length(chunky[[k]])){
    filename    <- chunky[[k]][i]
    xml         <- xmlTreeParse(filename)
    xml         <- xmlRoot(xml)
    result      <- sapply(getNodeSet(xml,"//body"), xmlValue)
    id          <- sapply(getNodeSet(xml,"//nr"), xmlValue)
    dummy       <- cbind(id,result)
    xmlCatcher  <- rbind(xmlCatcher,dummy)
    }
save(xmlCatcher,file=paste("xmlCatcher",k,".RData"))
}

Does somebody have an idea why this behaviour might occur? Note that all the objects (like 'xmlCatcher') are reused every cycle so that I would assume that the RAM used should stay about the same after the first 'chunk' cycle.

Garbage collection does not change a thing.
Not using rbind does not change a thing.
Using less xml-functions actually results in less memory grabbing - But Why?

Is this a bug or do I miss something?

Generally, it is not good practice to use `rbind` inside loops. I recommend creating an object with the length you need and overwrite its values (by indexing) instead. — Sven Hohenstein, Dec 21 '12 at 08:56
The line `DummyCatcher = rbind(DummyCatcher, dummy)` means you increase the size of `DummyCatcher` in each iteration, hence the increase in memory. — Sacha Epskamp, Dec 21 '12 at 08:56
@SvenHohenstein I would do as suggested but the results of the 'real' loop might differ in length, so I do not know beforehand how long the result will be. — petermeissner, Dec 21 '12 at 09:30
@Sacha yesno: I do increase it within the inner loop, but then I set it to NULL again in the outer loop, still the memory usage increases instead of being reset. — petermeissner, Dec 21 '12 at 09:32
@Sven , why is it a bad idea to use `rbind` inside a loop, because it produces the behavior described? — petermeissner, Dec 21 '12 at 09:52
Yes, you put the variable to NULL but probably the garbage collector is not immediately invoked. GC automatically triggers when needed (e.g. low free memory etc.), but if you want you can force it using gc(). — digEmAll, Dec 21 '12 at 09:58
Does that example code actually exhibit that behaviour? I'm worried by your 'I load something' note - are you using `load()`? — Spacedman, Dec 21 '12 at 10:00
It is not a good idea to let objects grow inside a loop because everytime you run `rbind` a new object is created. Of course, afterwards the memory assigned to the old object could be released. But a further problem is that it takes quite long. — Sven Hohenstein, Dec 21 '12 at 10:00
As a side note, you should avoid statements such as `for(k in 1:length(chunk))`. If the length of `chunk` is zero, this becomes `1:0`, _i.e._ `c(1, 0)`, although it is intended to be `NULL`. Better use `for (k in seq_along(chunk))`. — QkuCeHBH, Dec 21 '12 at 11:18
... updated the example to be a real working although not really minimal one! — petermeissner, Dec 21 '12 at 14:49

score 7 · Answer 1 · answered Dec 21 '12 at 12:00

7

Your understanding of reusing memory is wong.

When you create the new DummyCatcher, the old one is no longer referenced and then becomes candidate for garbage collection, which will happen at some point.

You are not reusing memory, you are creating a new object and abandon the old one.

Garbage collection will free the memory.

Also, i suggest you look at Rprofmem to profile your memory use.

answered Dec 21 '12 at 12:00

Romain Francois

17,432
3
51
77

Garbage collection 'gc()' actually does not help at all. I put it in my updated example, nothing changes. Regarding the idea with Rprofmem, I do not understand how to mak use of it. It works, but these numbers are a miracle. – petermeissner Dec 21 '12 at 15:11
You create bigger and bigger objects, why do you expect the memory used not to grow ? – Romain Francois Dec 21 '12 at 19:11
Actually, i create objects that grow and than I tell them to be `NULL` again to let them then grow again. Therefore, I would expect that memory usage grows until the end of the first inner loop and from then on should stay nearly the same because 1) R does not give free memory when it suspects it is needed anyways so I would expect R to keep the RAM once grabbed unless it is needed elsewhere; 2) the objects set to be `NULL` are filled with approximately the same amount of data so I would expect R to not further grab RAM. – petermeissner Dec 21 '12 at 19:41
Try having your call to `gc()` after the `xmlCatcher <- NULL` line. – Romain Francois Dec 22 '12 at 08:03

agstudy · Answer 2 · 2012-12-21T10:36:32.183

4

The chpater 2 of this talk about the rbind as a common|means of being a glutton.

You can avoid the use of rbind inside the loop,

my.list <- vector('list', chunk[k])
for(i in 1:chunk[k]) {
   dummy <- dummy + 1
   my.list[[i]] <- data.frame(dummy)
}
DummyCatcher  <- do.call('rbind', my.list)

edited Dec 21 '12 at 10:36

answered Dec 21 '12 at 10:28

agstudy

119,832
17
199
261

Plus 1 just for referencing Burns' book. One of my favorite references. – Carl Witthoft Dec 21 '12 at 12:38
-1 if would have enough reputation: It does not help at all, I tried it and the memory usage is growing in the same way: my.list <- vector('list', length(chunky)) for(k in 1:length(chunky)){ gc() print(chunky[[k]]) for(i in 1:length(chunky[[k]])){ filename <- chunky[[k]][i] xml <- xmlTreeParse(filename) xml <- xmlRoot(xml) result <- sapply(getNodeSet(xml,"//body"), xmlValue) id <- sapply(getNodeSet(xml,"//nr"), xmlValue) dummy <- cbind(id,result) my.list[[k]][[i]] <- data.frame(id,result) } } – petermeissner Dec 21 '12 at 15:06

score 2 · Accepted Answer · answered Dec 21 '12 at 19:30

Its the XML-package stupid!

The answer to this question came by Milan Bouchet-Valat here who proposed I should try to use the useInternalNodes=TRUE-option for xmlTreeParse. That stopped the RAM grabbing although there is also the possibility to manually handle memory-freeing. For further reading see: here.

R does not stop grabbing memory / RAM due to XML

3 Answers3