2

I've read this question Parse XML Files (>1 megabyte) in R and this answer seems to only apply to the original XML package in R. How do I set this option in xml2?

Here is the code I'm running:

library(xml2)
library(magrittr)

rawXML <- read_xml(xmlFile)
emails <- xml_find_all(rawXML, "//header")

emailElements <- sapply(1:length(emails), function(idx) {
attrs <- xml_attrs(emails[idx])[[1]]
index <- attrs['index']
sender <- attrs['from']
...
contentLink <- attrs['contentLink'] #is a *.html file
rawContentText <- read_html(contentLink)
content <- xml_text(rawContentText)
...
v <- c(index, date, sender, subject, headerLink, rawLink, contentLink, content, attachmentLink)
return(v)
})

Here is the error I get:

Error: Excessive depth in document: 256 use XML_PARSE_HUGE option [1]

Thanks in advance.

Community
  • 1
  • 1
tblznbits
  • 6,602
  • 6
  • 36
  • 66
  • I don't think this is currently implemented in `xml2`. Try opening an issue on github. – Jeroen Ooms Jul 15 '15 at 07:45
  • If you need a quick fix, I forked `xml2` and forced the `XML_PARSE_HUGE` option: https://github.com/shabbychef/xml2 . I also put it in my drat store, so you can `drat:::add("shabbychef"); install.packages('xml2')` – shabbychef Oct 27 '15 at 04:56
  • @shabbychef Thanks! When I was working through the issue at the time, I came to realize that the issue was that the XML was malformed. By passing the `XML_PARSE_HUGE` option, I was able to overcome the problem, but that's a dangerous approach given that malicious code can live inside the XML. I'm not an expert on the matter by any means, but I'd have to believe Hadley had a reason for not implementing it in the package. With that being said, I do appreciate the option of dealing with large XML files. So thanks again! – tblznbits Oct 27 '15 at 12:57
  • 1
    @brittenb I believe Hadley made the right decision, as passing the option on through subfunctions clutters the API. However, in some cases, like mine, a huge (but valid) HTML file runs against the parser limits. – shabbychef Oct 27 '15 at 17:37
  • @brittenb It's unlikely you still have this issue but [`htmltidy`](https://github.com/hrbrmstr/htmltidy) (use the GH version as I need to do a CRAN push of it soon) may help as it fixes "broken" HTML (or does it's best to) so it's more parse-able. – hrbrmstr Sep 24 '16 at 09:41

0 Answers0