0

I am trying to parse out the content from XML file (more than 200,000 files, 800MB) using XML package in R and save them to the text file for further processing. However my laptop has only 4G RAM and R session always crashed while doing this. My code is as following, I have tried to use ldply(), rm(), and gc() after rm(). Yet the memory problem still existed. Can somebody point out my problems? Thank you very much!

# read the file names
file_list = list.files()

parseXml = function(filename) {
  data = xmlTreeParse(filename, useInternalNodes = T)
  for (i in 1:length(xpathApply(data, "//mesh_term", xmlValue)) ) {
    tmp = data.frame("nct_id" = character(), "mesh_term" = character(), 
                      stringsAsFactors = F)

    # skip those trials data without mesh_term
    if (length(xpathApply(data, "//mesh_term", xmlValue)) > 0) { 
          tmp[1, 1] = xpathApply(data, "//nct_id", xmlValue)[[1]]
          tmp[1, 2] = xpathApply(data, "//mesh_term", xmlValue)[[i]]
    }
  }
  return(tmp)
  rm(tmp)
  gc()
}

# chop file_list into 1000 sections and do
# 1000 iteration, I assume that this can save some memory (but useless)
n = 1000
for (i in 1:n) {
  trialMesh = ldply(file_list[ (length(file_list)/n * (i-1) + 1) : (length(file_list)/n * i) ], 
                    parseXml)
  write.table(trialMesh, paste0("mypath/trialMesh_", i, ".txt"), sep="|", 
              eol="\n", quote=F, row.names = F, col.names = T)
  rm(trialMesh)
  gc()
}
  • 1
    Not that this will necessarily fix the problem, but your `rm(tmp)` and `gc()` statements are not being executed since they occur after a `return` statement. – nrussell Nov 02 '15 at 20:52
  • Have you tried xml2? – Jack Wasey Nov 02 '15 at 21:03
  • 1
    You are reading the `xml` files with the `useInternalNodes = TRUE` argument and this just creates a reference to a `C` object whose memory you have to manually release. Call `free(data)` _before_ the `return` statement in your function and see if it helps. Alternatively, set the above argument to `FALSE` and the let the `R` garbage collector do the job. See `?XML::free`. – nicola Nov 02 '15 at 21:17
  • @nrussell thanks for reminding! I didn't notice that! – ckbjimmy Nov 03 '15 at 14:49
  • @JackWasey Haven't try it yet. Will update later when I try it. Thanks! – ckbjimmy Nov 03 '15 at 14:50
  • @nicola Thanks so much! `free(data)` didn't work on my case, I even add `rm(data)`, `free(data)`, then `gc()` before `return(tmp)`, but memory leak still existed. It looks like that I meet the same problem as [previous question](http://stackoverflow.com/questions/9220849/serious-memory-leak-when-iteratively-parsing-xml-files) I just found it. However his final solution also couldn't work on mine. As you say, setting `useInternalNodes = FALSE` can change it into `R` object. Yet I couldn't use `xmlValue(data$doc$...)` to extract the items, I'm still looking for the reasons. – ckbjimmy Nov 03 '15 at 14:56
  • @nicola Finally I parsed out the content with `free(data)` before `return(tmp)` on my Mac instead of Windows desktop. But actually I don't know the reason why it worked on Mac since it still had memory leak (but not so much!?). – ckbjimmy Nov 04 '15 at 00:19

0 Answers0