2

Currently I have ~20,000 XML files that range in size from a couple of KB to a few MB. Although it may not be ideal, I am using the "xmlTreeParse" function in the XML package to loop through each of the files and extract the text that I need and save the document as a csv file.

The code below works fine for files <1 MB in size:

files <- list.files()
for (i in files) {
    doc <- xmlTreeParse(i, useInternalNodes = TRUE)
    root <- xmlRoot(doc)

    name <- xmlValue(root[[8]][[1]][[1]]) # Name
    data <- xmlValue(root[[8]][[1]]) # Full text

    x <- data.frame(c(name))
    x$data <- data

    write.csv(x, paste(i, ".csv"), row.names=FALSE, na="")
}

The trouble is that any file >1 MB gives me the following error:

Excessive depth in document: 256 use XML_PARSE_HUGE option
Extra content at the end of the document
Error: 1: Excessive depth in document: 256 use XML_PARSE_HUGE option
2: Extra content at the end of the document

Please forgive my ignorance, however I have tried searching for the "XML_PARSE_HUGE" function in the XML package and can't seem to find it. Has anyone had any experience using this function? If so, I would greatly appreciate any advice as to how to get this code to handle slightly larger XML files.

Thanks!

Entropy
  • 378
  • 6
  • 16
  • 1
    try `xmlTreeParse(options = HUGE)` – user1609452 Jun 17 '13 at 18:59
  • Worked brilliantly -- thanks very much! – Entropy Jun 17 '13 at 20:50
  • if this answers your question please consider marking the question as answered. stackoverflow.com/help/accepted-answer – user1609452 Jul 14 '13 at 04:12
  • Thanks -- sorry for not figuring out how to do that earlier. – Entropy Jul 14 '13 at 15:57
  • Just wondering if you have experienced any memory leaks with files of this size? I am trying to parse in XML of >10Meg and no amount of free(), rm(), gc() on the XML document after I'm done with it, releases the (hundreds of megs of) memory to the O/S (this is Windows 7 64bit). – Matthew Wise Mar 03 '15 at 15:53

1 Answers1

2

To choose "XML_PARSE_HUGE" you need to stipulate it in the options. XML:::parserOptions lists the option choices:

> XML:::parserOptions
   RECOVER      NOENT    DTDLOAD    DTDATTR   DTDVALID    NOERROR  NOWARNING 
         1          2          4          8         16         32         64 
  PEDANTIC   NOBLANKS       SAX1   XINCLUDE      NONET     NODICT    NSCLEAN 
       128        256        512       1024       2048       4096       8192 
   NOCDATA NOXINCNODE    COMPACT      OLD10  NOBASEFIX       HUGE     OLDSAX 
     16384      32768      65536     131072     262144     524288    1048576 

for example

> HUGE
[1] 524288

It is suffiecient to declare a vector of integers with any of these options. In your case

xmlTreeParse(i, useInternalNodes = TRUE, options = HUGE)
user1609452
  • 4,406
  • 1
  • 15
  • 20
  • Can you add these decimal numbers together (since it appears they are just bit locations in a binary number)? – IRTFM Jul 15 '15 at 00:34