1

I am trying to parse the data contained in an XML file using R. I would like the data to be read into a dataframe. The data is located at the following URL:

http://www.usda.gov/oce/commodity/wasde/report_format/latest-July-2015-New-Format.xml

I have reviewed some similar posts on stackoverflow such as How to convert xml data to data frame in R but I still cannot seem to get them to work. One issue I ran into is that some of the values are empty and then appear like this:

<Cell Textbox73="filler" /> 

The general key value pair is structured as follows:

<m1_attribute_group>
   <s3 attribute1="Trade 2/">
       <s4>
           <Cell cell_value1="372.02"/>
       </s4>
   </s3>
</m1_attribute_group>

Is there any way to parse this file quickly with xmlToDataFrame() or xmlAttrsToDataFrame() or do I have to build custom functions to apply over the list of nodes?

Community
  • 1
  • 1
klib
  • 697
  • 2
  • 11
  • 27

1 Answers1

4

First, you need to know exactly what you want extracted. So, dissection is required. This USDA XML output is quite involved across commodities, year/months, and regions with changing nodes for attributes and text/cell values.

Consider an XSLT transformation. As information, XSLT is a special-purpose declarative language that transforms XML files into any nuances for diverse purposes. While other programming languages like R can parse various elements, values, and attributes from XML files, XSLT is specially designed to re-format whole documents to specific needs. Often considered, the forgotten language in the web data age, it can be very useful to transform complex xml data.

For your purposes here is a simple XSL stylesheet to extract Output, Total Supply, Trade, Total Use, and Ending Stocks from various commodities (two value-of declarations are made due to the changing xpaths of the USDA data):

XSLT STYLESHEET

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:strip-space elements="*"/>

<xsl:template match="/">
  <xsl:element name="data">
     <xsl:for-each select="//m1_commodity_group/m1_year_group_Collection/m1_year_group
                            /m1_month_group_Collection/m1_month_group/m1_attribute_group_Collection">
      <xsl:element name="m1_attribute_group">

      <xsl:element name="Output">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Output']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group_Collection/m1_attribute_group[@attribute1='Output']/Textbox12/Cell/@cell_value1"/>  
      </xsl:element>      

      <xsl:element name="TotalSupply">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Total Supply']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Total&#13;&#10;Supply']/Textbox12/Cell/@cell_value1"/>
      </xsl:element>      

      <xsl:element name="Trade2">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Trade 2/']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Trade 2/']/Textbox12/Cell/@cell_value1"/>      
      </xsl:element>      

      <xsl:element name="TotalUse3">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Total&#13;&#10;Use 3/']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Total&#13;&#10;Use 3/']/Textbox12/Cell/@cell_value1"/>
      </xsl:element>

      <xsl:element name="EndingStocks">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Ending&#13;&#10;Stocks']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Ending&#13;&#10;Stocks']/Textbox12/Cell/@cell_value1"/> 
      </xsl:element>

      </xsl:element>
    </xsl:for-each>
  </xsl:element>
</xsl:template>

</xsl:stylesheet>

OUTPUT

<data>
  <m1_attribute_group>
    <Commodity>Total Grains 4/</Commodity>
    <MarketYear>2013/14</MarketYear>
    <Output>2474.27</Output>
    <TotalSupply>2930.14</TotalSupply>
    <Trade2>372.02</Trade2>
    <TotalUse3>2417.87</TotalUse3>
    <EndingStocks>512.27</EndingStocks>
  </m1_attribute_group>
  ...
</data>

Nearly all programming languages (i.e., Java, C#, Python, PHP) have XSLT transformation libraries, including R with the Sxlst package. However, below is an R script interfacing with VBA calling the Windows MSXML COM object. It requires the RDCOMClient library, an easy installation with RStudio. This is only a Windows PC solution.

R SCRIPT

library(XML)
library(RDCOMClient)

# CREATE INSTANCES OF MICROSOFT XML OBJECTS 
xmlfile = COMCreate("MSXML2.DOMDocument")
xslfile = COMCreate("MSXML2.DOMDocument")
newxmlfile = COMCreate("MSXML2.DOMDocument")

xmlstr = "http://www.usda.gov/oce/commodity/wasde/report_format/latest-July-2015-New-Format.xml"
xslstr = "C:/Path/To/XSLFile.xsl"
newxmlstr = "C:/Path/To/TransformedXMLFile.xsl"

# LOAD XML & XSLT FILES
xmlfile.async = FALSE
xmlfile$Load(xmlstr)

xslfile.async = FALSE
xslfile$Load(xslstr)

# TRANSFORM XML FILE USING XLST INTO NEW FILE
xmlfile$transformNodeToObject(xslfile, newxmlfile)
newxmlfile$Save(newxmlstr)

# CONVERT TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:/Path/To/TransformedXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//m1_attribute_group"))

# UNINITIALIZE MSXML OBJECTS 
xmlfile <- NULL 
xslfile <- NULL
newxmlfile <- NULL

Above stylesheet renders the following data frame:

DATA FRAME

    Commodity   MarketYear  Output  TotalSupply Trade2  TotalUse3   EndingStocks
1   Total Grains 4/ 2013/14 2474.27 2930.14 372.02  2417.87 512.27
2   Total Grains 4/ 2014/15 (Est.)  2492.63 3004.90 375.54  2464.63 540.27
3   Total Grains 4/ 2015/16 (Proj.)     2479.30 3010.96 360.65  2489.78 521.18
4   Total Grains 4/ 2015/16 (Proj.)     2477.26 3017.53 362.61  2485.14 532.39
5   Wheat   2013/14 715.11  892.24  165.92  698.72  193.52
6   Wheat   2014/15 (Est.)  725.92  919.44  163.94  707.38  212.06
7   Wheat   2015/16 (Proj.)     721.55  921.96  158.41  719.56  202.40
8   Wheat   2015/16 (Proj.)     721.96  934.01  158.07  714.20  219.81
9   Coarse Grains 5/    2013/14 1280.98 1449.07 164.39  1237.67 211.40
10  Coarse Grains 5/    2014/15 (Est.)  1290.43 1501.83 168.58  1272.56 229.27
11  Coarse Grains 5/    2015/16 (Proj.)     1276.00 1508.58 159.74  1281.23 227.35
12  Coarse Grains 5/    2015/16 (Proj.)     1274.97 1504.24 162.30  1282.17 222.07
13  Rice, milled    2013/14 478.18  588.83  41.72   481.48  107.36
14  Rice, milled    2014/15 (Est.)  476.28  583.64  43.02   484.69  98.95
15  Rice, milled    2015/16 (Proj.)     481.74  580.43  42.51   488.99  91.44
16  Rice, milled    2015/16 (Proj.)     480.34  579.28  42.24   488.78  90.51
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Wow I haven't heard of it but i'll give it a shot, thanks – klib Jul 31 '15 at 11:40
  • unfortunately, I have to do this in windows and no one seems to have figured out how to install the Sxlst package in windows...back to the drawing board – klib Jul 31 '15 at 13:01
  • Do you have MS Excel or MS Access installed? VBA can transform XML with [MSXML object](https://msdn.microsoft.com/en-us/library/ms763701(v=vs.85).aspx)? Alternatively, open-source [PHP](http://stackoverflow.com/questions/13742972/transform-xml-with-xslt-in-php) and [Python](http://stackoverflow.com/questions/16698935/how-to-transform-an-xml-file-using-xslt-in-python) can run xslt files. For above, I used Python. – Parfait Jul 31 '15 at 13:50
  • 1
    See my edits above. I now include a pure R script that calls the MSXML object using the RDCOMClient library. This should allow you to transform XML files with XSLT stylesheets. – Parfait Aug 04 '15 at 13:58