0

I'm trying to read this file in R. I tried to use XML package, but I have no idea what is in the data set and I haven't used the package before.

I'd appreciate any help from you guys.

Thanks.

Sebastián.

Iguananaut
  • 21,810
  • 5
  • 50
  • 63
user3311904
  • 1
  • 1
  • 1

2 Answers2

4

There's no way around it - you need to understand XML and XPath to use it in R. Assuming you do, view the document in a browser to get an idea of its structure. Then, this should get you started using the XML package.

library(XML)
xml <- xmlParse("http://data.mcc.gov/raw/xml/MCC_HN.xml")

org        <- xpathApply(xml,"//iati-activity/reporting-org",xmlValue)
id         <- xpathApply(xml,"//iati-activity/iati-identifier",xmlValue)
title      <- xpathApply(xml,"//iati-activity/title",xmlValue)
desc.1     <- xpathApply(xml,"//iati-activity/description[@type='1']",xmlValue)
desc.2     <- xpathApply(xml,"//iati-activity/description[@type='2']",xmlValue)
status     <- xpathApply(xml,"//iati-activity/activity-status",xmlValue)
start.planned <- xpathApply(xml,"//iati-activity/activity-date[@type='start-planned']",xmlValue)
start.actual  <- xpathApply(xml,"//iati-activity/activity-date[@type='start-actual']",xmlValue)
end.planned   <- xpathApply(xml,"//iati-activity/activity-date[@type='end-planned']",xmlValue)
end.actual    <- xpathApply(xml,"//iati-activity/activity-date[@type='end-actual']",xmlValue)

df <- data.frame(cbind(org,id, title, status, 
                       start.planned, start.actual, end.planned, end.actual,
                       desc.1, desc.2))

Read the documentation on the functions I've used above, e.g. xmlParse(...), xpathApply(...), and xmlValue(...) to figure out what the code is doing.

One note: there is a function xmlToDataFrame(...) in the XML package. The problem with your document is that you have multiple elements with the same tag name (examples: description and activity-date), which are disambiguated using the type= attribute. xmlToDataFrame(...) doesn't know how to deal with that, so you need to do it the hard way...

BenMorel
  • 34,448
  • 50
  • 182
  • 322
jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • How did you do to know what variables are inside the XML data? You're help was very handy. Thanks! – user3311904 Feb 17 '14 at 14:36
  • You need to view the file in a browser, basically. There are programmatic ways to view a list of tag names, but viewing in a browser is easiest. – jlhoward Feb 17 '14 at 17:16
1

It's not really clear what you want to do with the data, but here we get it

xml = xmlParse("http://data.mcc.gov/raw/xml/MCC_HN.xml")

Then query the result for all "transaction" records and make them into a data frame

df <- xmlToDataFrame(xml["//transaction"])

with

> dim(df)
[1] 730  11
> head(df, 2)
  aid-type
1         
2         
                                                                description
1   Commitment: Honduras-614G Fund-Not Applicable-Not Applicable-2011-04-01
2 Disbursement: Honduras-614G Fund-Not Applicable-Not Applicable-2011-04-01
  disbursement-channel finance-code flow-type                     provider-org
1                                             Millennium Challenge Corporation
2                                             Millennium Challenge Corporation
  receiver-org tied-status transaction-date transaction-type     value
1     Honduras                   2011-04-01       COMMITMENT 274380.75
2     Honduras                   2011-04-01     DISBURSEMENT      0.00

Maybe you'd like to extract the attribute associated with 'aid-type' and add it to the data frame; use XPath to do so

df$`aid-type-code` <- as.character(xml["//aid-type/@code"])
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112