Parsing small web page with xml2 throws XML_PARSE_HUGE error

Question

Recently a user of my rNOMADS package in R began getting unexpected errors:

Error: Excessive depth in document: 256 use XML_PARSE_HUGE option [1]

We tracked the issue down to this command:

html.tmp <- xml2::read_html("http://nomads.ncep.noaa.gov/cgi-bin/filter_rap.pl?dir=%2Frap.20151120")

Upon following the link, it appears that the web page to be parsed is no larger than other ones that work fine, and much less than the 1 megabyte limit that should require the XML_PARSE_HUGE option. Furthermore,

xml2::read_html

actually has no XML_PARSE_HUGE option anyway. The only other potential solution, described here, is not appropriate for an official R package.

What is the cause of this error, and is it possible to resolve it without resorting to solutions outside the official CRAN repository?

Yes, I know. The behavior is really hard to understand. I use the GFS model all the time, and I've never had this issue. — glossarch, Nov 20 '15 at 14:32
`XML_PARSE_HUGE` lifts several limitations. The one you're running into here is the maximum depth of the document tree which is limited to 256 by default. It doesn't take a large document to have more than 256 nested elements. — nwellnhof, Nov 21 '15 at 14:19
@nwellnhof how is this option set? The function read_html does not have it as an input and I am unclear how to change that. Thanks. — glossarch, Nov 23 '15 at 20:01

score 0 · Answer 1 · edited May 23 '17 at 11:52

0

The best I can do so far is to install shabbychef's forked version of xml2 that forces XML_PARSE_HUGE. You can install this version of xml2 via

library(drat)
drat:::add("shabbychef")
install.packages('xml2')

For the time being, please use this work around if you encounter XML_PARSE_HUGE errors in rNOMADS.

edited May 23 '17 at 11:52

Community

1
1

answered Dec 31 '15 at 00:29

glossarch

272
5
18

Parsing small web page with xml2 throws XML_PARSE_HUGE error

1 Answers1

Linked