Im stuck trying to parse a big xml-file into an R - data.frame object. The xml has the following schema:
<?xml version="1.0" encoding="ISO-8859-1"?>
<?eclipse version="3.0"?>
<ROOT>
<row>
<field name="dtcreated"></field>
<field name="headline"></field>
<subheadline/>
<field name="body"></field>
</row>
<row>
<field name="dtcreated"></field>
<field name="headline"></field>
<subheadline/>
<field name="body"></field>
</row>
</ROOT>
the plyr convenience functions didn't help, since the xml couldn't be validated. So I came up with the following code, using xpath queries:
adHocXml<-xmlTreeParse(adHocXmlPath,getDTD = FALSE)
adHocRoot<-xmlRoot(adHocXml)
creationDateColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='dtcreated']"), xmlValue)
headlineColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='headline']"), xmlValue)
bodyColumn<-sapply(getNodeSet(adHocRoot,"//row//field[@name='body']"), xmlValue)
adHocData<-data.frame(creationDate=creationDateColumn,headline=headlineColumn,body=bodyColumn)
The code does exactly what I expect it to do for a short file. With a large file and several thousand row-tags however, I get the following error after about 10 minutes:
Error: 1: internal error: Huge input lookup
2: Extra content at the end of the document
Can anyone help me?