5

I am trying to run some simple program to extract tables from html code. However, there seems to be some memory issue with readHTMLTable in XML package. Is there any way I could just work around this easily. Like somehow specifying some special memory for this command and then freeing it manually.

I have tried to put this in a function and tried to use gc() and different versions of R and this package and nothing seems to work. I start to get desperate.

Example code. How to run this without exploding memory size?

library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
while(TRUE) {
    b = readHTMLTable(a)
    #do something with b
}

Edit: Something like this still takes all of my memory:

library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
f <- function(x) {
    b = readHTMLTable(x)
    rm(x)
    gc()
    return(b)
}

for(i in 1:100) {
    d = f(a)
    rm(d)
    gc()
}
rm(list=ls())
gc()

I am using win 7 and tried with 32bit and 64bit.

Pekka
  • 2,348
  • 2
  • 21
  • 33
  • I've had serious memory issues using the `XML` package on Windows. My solution is to periodically restart R (saving the data to CSV). I emailed the package author. We exchanged some emails, but he basically said he can't/won't debug Windows. – rrs Jun 30 '14 at 23:12
  • Ok. Restarting R works but it's not so nice manual job to do every 5min.I guess only way to go is switch to linux. XML is very cool package but sadly destroyed with these memory issues. – Pekka Jul 01 '14 at 07:25

3 Answers3

1

As of XML 3.98-1.4 and R 3.1 on Win7, this problem can be solved perfectly by using the function free(). But it does not work with readHTMLTable(). The following code works perfectly.

library(XML)
a = readLines("http://en.wikipedia.org/wiki/2014_FIFA_World_Cup")
while(TRUE){
   b = xmlParse(paste(a, collapse = ""))
   #do something with b
   free(b)
}

The xml2 package has similar issues and the memory can be released by using the function remove_xml() followed by gc().

Peter
  • 83
  • 6
  • Does not fix the issue. If you extract any part of the document the whole object remains in memory. Where you have a while(TRUE) I have an lapply with a character return value extracting bits of the original document. – Karl Mar 14 '18 at 14:46
0

I had a lot of problems with memory leaks in the XML pakackage too (under both windows and linux), but the way I solved it eventually was to remove the object at the end of each processing step, i.e. add a rm(b) and a gc() at the end of each iteration. Let me know if this works for you too.

Tom Wenseleers
  • 7,535
  • 7
  • 63
  • 103
  • Doesn't help inside the loop nor after the loop. – Pekka Jul 07 '14 at 07:44
  • Well what I did was to put the XML processing in a function XMLproc that returned output out, and in the function where I was calling XMLproc then added rm(out) and gc(). Could you check perhaps if that works for you? – Tom Wenseleers Jul 07 '14 at 08:49
  • I didn't get anything like this work. I added something what I tried in the original question. – Pekka Jul 12 '14 at 17:48
  • Ha strange - I'm using windows too and XML_3.98-1.1 and your second edited piece of code runs fine for me without any memory leaks... Seems this XML package is quite flaky for the moment unfortunately... – Tom Wenseleers Jul 14 '14 at 16:27
  • I am running the same version of XML package. Did you check from the windows task manager how much memory R is taking after running that 2nd code? The statistics after the last gc() won't show the "hidden" consumed memory but when I check the task manager it shows ~500mb memory in use. Anyway, thanks for trying to help. – Pekka Jul 14 '14 at 18:29
  • Hi - yes I checked that - the memory consumption at least did not go up during the loop, whereas when I experienced the memory leak before it used up 2 Mb per extra iteration. So it looked OK to me! – Tom Wenseleers Jul 14 '14 at 20:20
0

Same problem here, even doing nothing more than reading in the document with doc <- xmlParse(...); root <- xmlRoot(doc), the memory allocated to doc is just never released to the O/S (as monitored in Windows' Task Manager).

A crazy idea that we might try is to employ system("Rscript ...") to perform the XML parsing in a separate R session, saving the parsed R object to a file, which we then read in in the main R session. Hacky but it would at least ensure that whatever memory is gobbled up by the XML parsing, is released when the Rscript session terminates and doesn't affect the main process!

Matthew Wise
  • 2,639
  • 26
  • 23
  • I solved this by moving to Python and BeautifulSoup long time ago :) Anyway, appreciate your help. I will try your solution soon. – Pekka Mar 06 '15 at 07:27
  • Also see the answer I posed on http://stackoverflow.com/questions/23696391/memory-leak-when-using-package-xml-on-windows/ re using the new xml2 library instead (I posted it as an answer to this question too but someone deleted it). – Matthew Wise Mar 06 '15 at 10:43
  • This solution is working for me today. It's not efficient, but it lets me use my legacy systems on thousands of files. – Karl Mar 14 '18 at 14:33