Parse XML Files (>1 megabyte) in R

Question

Currently I have ~20,000 XML files that range in size from a couple of KB to a few MB. Although it may not be ideal, I am using the "xmlTreeParse" function in the XML package to loop through each of the files and extract the text that I need and save the document as a csv file.

The code below works fine for files <1 MB in size:

files <- list.files()
for (i in files) {
    doc <- xmlTreeParse(i, useInternalNodes = TRUE)
    root <- xmlRoot(doc)

    name <- xmlValue(root[[8]][[1]][[1]]) # Name
    data <- xmlValue(root[[8]][[1]]) # Full text

    x <- data.frame(c(name))
    x$data <- data

    write.csv(x, paste(i, ".csv"), row.names=FALSE, na="")
}

The trouble is that any file >1 MB gives me the following error:

Excessive depth in document: 256 use XML_PARSE_HUGE option
Extra content at the end of the document
Error: 1: Excessive depth in document: 256 use XML_PARSE_HUGE option
2: Extra content at the end of the document

Please forgive my ignorance, however I have tried searching for the "XML_PARSE_HUGE" function in the XML package and can't seem to find it. Has anyone had any experience using this function? If so, I would greatly appreciate any advice as to how to get this code to handle slightly larger XML files.

Thanks!

if this answers your question please consider marking the question as answered. stackoverflow.com/help/accepted-answer — user1609452, Jul 14 '13 at 04:12
Thanks -- sorry for not figuring out how to do that earlier. — Entropy, Jul 14 '13 at 15:57
Just wondering if you have experienced any memory leaks with files of this size? I am trying to parse in XML of >10Meg and no amount of free(), rm(), gc() on the XML document after I'm done with it, releases the (hundreds of megs of) memory to the O/S (this is Windows 7 64bit). — Matthew Wise, Mar 03 '15 at 15:53

score 2 · Accepted Answer · answered Jun 18 '13 at 02:40

To choose "XML_PARSE_HUGE" you need to stipulate it in the options. XML:::parserOptions lists the option choices:

> XML:::parserOptions
   RECOVER      NOENT    DTDLOAD    DTDATTR   DTDVALID    NOERROR  NOWARNING 
         1          2          4          8         16         32         64 
  PEDANTIC   NOBLANKS       SAX1   XINCLUDE      NONET     NODICT    NSCLEAN 
       128        256        512       1024       2048       4096       8192 
   NOCDATA NOXINCNODE    COMPACT      OLD10  NOBASEFIX       HUGE     OLDSAX 
     16384      32768      65536     131072     262144     524288    1048576

for example

> HUGE
[1] 524288

It is suffiecient to declare a vector of integers with any of these options. In your case

xmlTreeParse(i, useInternalNodes = TRUE, options = HUGE)

Can you add these decimal numbers together (since it appears they are just bit locations in a binary number)? — IRTFM, Jul 15 '15 at 00:34

Parse XML Files (>1 megabyte) in R

1 Answers1

Linked