Great answers above! For future readers, anytime you face a complex XML needing R import, consider re-structuring the XML document using XSLT (a special-purpose declarative programming language that manipulates XML content into various end-use needs). Then simply use R's xmlToDataFrame()
function from XML package.
Unfortunately, R does not have a dedicated XSLT package available on CRAN-R across all operating systems. The listed SXLT seems to be a Linux package and not able to be used on Windows. See unanswered SO questions here and here. I understand @hrbrmstr (above) maintains a GitHub XSLT project. Nonetheless, nearly all general-purpose languages maintain XSLT processors including Java, C#, Python, PHP, Perl, and VB.
Below is the open-source Python route and because the XML document is pretty nuanced, two XSLTs are being used (of course XSLT gurus can combine them into one but tried as I might couldn't get it to work.
FIRST XSLT (using a recursive template)
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="record/text()" name="tokenize">
<xsl:param name="text" select="."/>
<xsl:param name="separator" select="' '"/>
<xsl:choose>
<xsl:when test="not(contains($text, $separator))">
<data>
<xsl:value-of select="normalize-space($text)"/>
</data>
</xsl:when>
<xsl:otherwise>
<data>
<xsl:value-of select="normalize-space(substring-before($text, $separator))"/>
</data>
<xsl:call-template name="tokenize">
<xsl:with-param name="text" select="substring-after($text, $separator)"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="description|variables|categoricalvariable|realvariable">
</xsl:template>
SECOND XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- Identity Transform -->
<xsl:template match="records">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="record">
<record>
<area_name><xsl:value-of select="@label"/></area_name>
<area><xsl:value-of select="data[1]"/></area>
<region><xsl:value-of select="data[2]"/></region>
<palmitic><xsl:value-of select="data[3]"/></palmitic>
<palmitoleic><xsl:value-of select="data[4]"/></palmitoleic>
<stearic><xsl:value-of select="data[5]"/></stearic>
<oleic><xsl:value-of select="data[6]"/></oleic>
<linoleic><xsl:value-of select="data[7]"/></linoleic>
<linolenic><xsl:value-of select="data[8]"/></linolenic>
<arachidic><xsl:value-of select="data[9]"/></arachidic>
<eicosenoic><xsl:value-of select="data[10]"/></eicosenoic>
</record>
</xsl:template>
</xsl:stylesheet>
Python (using lxml module)
import lxml.etree as ET
cd = os.path.dirname(os.path.abspath(__file__))
# FIRST TRANSFORMATION
dom = ET.parse('http://www.ggobi.org/book/data/olive.xml')
xslt = ET.parse(os.path.join(cd, 'Olive.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
# SECOND TRANSFORMATION
dom = ET.parse(os.path.join(cd, 'Olive_py.xml'))
xslt = ET.parse(os.path.join(cd, 'Olive2.xsl'))
transform = ET.XSLT(xslt)
newdom = transform(dom)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True, xml_declaration=True)
xmlfile = open(os.path.join(cd, 'Olive_py.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()
R
library(XML)
# LOADING TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("Olive_py.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//record"))
View(xmldf)
Output
area_name area region palmitic palmitoleic stearic oleic linoleic linolenic arachidic eicosenoic
North-Apulia 1 1 1075 75 226 7823 672 na 60
North-Apulia 1 1 1088 73 224 7709 781 31 61 29
North-Apulia 1 1 911 54 246 8113 549 31 63 29
North-Apulia 1 1 966 57 240 7952 619 50 78 35
North-Apulia 1 1 1051 67 259 7771 672 50 80 46
...
(slight cleanup on very first record is needed as an extra space was added after "na" in xml doc, so arachidic
and eicosenoic
were shifted forward)