4

Im stuck trying to parse a big xml-file into an R - data.frame object. The xml has the following schema:

<?xml version="1.0" encoding="ISO-8859-1"?>
<?eclipse version="3.0"?>
  <ROOT>
  <row>
    <field name="dtcreated"></field>
    <field name="headline"></field>
    <subheadline/>
    <field name="body"></field>
  </row>
  <row>
    <field name="dtcreated"></field>
    <field name="headline"></field>
    <subheadline/>
    <field name="body"></field>
  </row>
</ROOT>

the plyr convenience functions didn't help, since the xml couldn't be validated. So I came up with the following code, using xpath queries:

adHocXml<-xmlTreeParse(adHocXmlPath,getDTD = FALSE)
adHocRoot<-xmlRoot(adHocXml)
creationDateColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='dtcreated']"), xmlValue)
headlineColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='headline']"), xmlValue)
bodyColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='body']"), xmlValue)
adHocData<-data.frame(creationDate=creationDateColumn,headline=headlineColumn,body=bodyColumn)

The code does exactly what I expect it to do for a short file. With a large file and several thousand row-tags however, I get the following error after about 10 minutes:

Error: 1: internal error: Huge input lookup
2: Extra content at the end of the document 

Can anyone help me?

Oblomov
  • 8,953
  • 22
  • 60
  • 106

1 Answers1

5

libxml has an upper limit on the size a single node can be. You can turn this limit off by enabling the parser flag XML_PARSE_HUGE. In R package XML you would do this as:

library(XML)
xmlParse(myXML, options = HUGE)

You may also want to look at xmlEventParse. Martin Morgan provides a good example on its use here.

Community
  • 1
  • 1
jdharrison
  • 30,085
  • 4
  • 77
  • 89
  • Thanks for your help So I tried it with adHocXml<-xmlTreeParse(adHocXmlPath,getDTD = FALSE,options = HUGE) , but still run into the same problem. Is this due to the complete XML file size (20MB) or due to the size of indidual node texts? – Oblomov Dec 12 '14 at 22:47
  • Try with `xmlParse` rather than `xmlTreeParse`. Or if you use `xmlTreeParse` use argument `useInternalNodes = TRUE`. – jdharrison Dec 12 '14 at 22:52
  • xmlParse just returns an empty object. That's why I am using xmlTreeParse, since this was the only method that could cope with my document. – Oblomov Dec 12 '14 at 23:15
  • The additional useInternalNodes = TRUE option solved my problem. Thanks a lot! – Oblomov Dec 12 '14 at 23:19
  • Happy to help. If the answer solves your problem consider marking the question as answered by ticking the box on the answer. – jdharrison Dec 12 '14 at 23:22