I am fairly new to using R and very new to using the XML package and xpath. I need to pull four elements from an xml file that looks like this (except that I have trimmed off a lot of other xmlnodes to simplify it here):
<?xml version="1.0" encoding="utf-8"?>
<iati-activities version="1.03" generated-datetime="2015-07-07T16:49:09+00:00">
<iati-activity last-updated-datetime="2014-08-11T14:36:59+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100530</iati-identifier>
<title>Improvement of basic health care</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<participating-org role="Funding" ref="EU" type="15">EU</participating-org>
<participating-org role="Funding" type="21">Cordaid Memisa</participating-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="21">CORDAID RCA</participating-org>
<recipient-country percentage="100" code="CF">CENTRAL AFRICAN REPUBLIC</recipient-country>
<budget type="1">
<period-start iso-date="2010-01-01"></period-start>
<period-end iso-date="2013-02-28"></period-end>
</budget>
</iati-activity>
<iati-activity last-updated-datetime="2013-07-19T14:12:14+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100625</iati-identifier>
<title>Pigs for Pencils</title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Funding" type="60">Stichting Kapatiran</participating-org>
<participating-org role="Implementing" type="22">PREDA Foundation Inc.</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
<period-start iso-date="2010-04-20"></period-start>
<period-end iso-date="2012-10-02"></period-end>
<value value-date="2010-04-20">12500</value>
</budget>
</iati-activity>
<iati-activity last-updated-datetime="2015-04-08T03:01:58+00:00" xml:lang="en" default-currency="EUR">
<iati-identifier>NL-KVK-41160054-100815</iati-identifier>
<title>Job and housing opportunities for women </title>
<reporting-org ref="NL-KVK-41160054" type="21">Stichting Cordaid</reporting-org>
<participating-org role="Funding" ref="NL-1" type="10">Dutch Ministry of Foreign Affairs</participating-org>
<participating-org role="Implementing" type="22">WISE</participating-org>
<participating-org role="Accountable" ref="NL-KVK-41160054" type="21">Cordaid</participating-org>
<budget type="2">
<period-start iso-date="2010-10-01"></period-start>
<period-end iso-date="2011-12-31"></period-end>
<value value-date="2010-10-01">227000</value>
</budget>
</iati-activity>
</iati-activities>
Also this is my first question ever on StackOverflow, so apologies if I'm not doing it correctly (and bc that xml is not perfectly aligned). The elements I need, and what I'm assigning them to are:
UniqueID <- "//iati-activity/iati-identifier"
GrantTitle <- "//iati-activity/title"
GrantAmount <- "//iati-activity/budget/value"
Recipient <- "//iati-activity/participatingorg[@role='Implementing']"
So far (after much trial and tribulation) I have come up with this code, that goes through the current node (x), pulling the 4 variables, and cbinding them into a row, then using xpathApply to loop through iati-activity nodes calling the function and rbinding the resulting rows together.
This code works when all four elements exist in each activity. However, note the absence of the budget/value node from the xml sample. This is because I removed it in order to solve this problem of missing nodes, which occur frequently in the full file for almost all the elements I need.
Also note the [1] at the end of my xpath expressions- I've included these because there are also multiple titles, multiple participating-orgs of all types, etc.
Given the multiples of some elements and the nonexistence of others, it makes it impossible to simple pull all the same elements into a vector and pop it into a data frame. Thus the need to loop through each activity pulling the elements with it. My code currently doesn't work to account for missing elements (the missing budget/value in the first iati-activity) because cbinding (and rbinding) ignore null vectors.
xmltestNA = xmlInternalTreeParse("XMLtoDF_TestNA.xml", useInternalNodes=TRUE)
bodyToDF <- function(x){
UniqueID <- xpathSApply(x, "./iati-identifier", xmlValue)
GrantTitle <- xpathSApply(x, "./title[1]", xmlValue)
GrantAmount <- xpathSApply(x, "./budget/value[1]", xmlValue)
Recipient <- xpathSApply(x, "./participating-org[@role='Implementing'][1]", xmlValue)
cbind(UniqueID=UniqueID, GrantTitle=GrantTitle, GrantAmount=GrantAmount, Recipient=Recipient)
}
res <-xpathApply(xmltestNA, '//iati-activity', fun=bodyToDF)
IatiNA <-do.call(rbind, res)
IatiNA
How can I keep the null values/missing nodes in order to turn it into a matrix or dataframe that looks like this:
UniqueID GrantTitle GrantAmount Recipient
1 NL-KVK-41160054-100530 Improvement of basic health care NA CORDAID RCA
2 NL-KVK-41160054-100625 Pigs for Pencils 12500 PREDA Foundation Inc.
3 NL-KVK-41160054-100815 Job and housing opportunities for women 227000 WISE
Because I'm still new, the simpler the code, the better. Thanks in advance!