6

I was wondering if anyone has managed to read SDMX-XML files into a dataframe. The file I’d like to read is https://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml (1mb). I saved the file as “pensions_funds.xml” to the pwd and tried to use the XML package to read it:

fileName <- system.file("pensions", "pensions_funds.xml", package="XML")
parsed<-xmlTreeParse("pension_funds.xml",getDTD=F)
r<-xmlRoot(parsed)
tmp = xmlSApply(r, function(x) xmlSApply(x, xmlValue))

The few lines above basically follow the example here http://www.omegahat.org/RSXML/gettingStarted.html but I think I would first need to somehow ignore the header (I have pasted below the first couple of pages of the file I’m trying to read). So I think the above might work but it starts from the wrong node for my purposes. I would like to grab the obs_values, indexed by their time_period and ref_area.

The first thing would be to find the right node and start there however I suspect I might be on a fool’s errand since I have limited knowledge of data formats and I’m not sure the XML package can be used for SDMX-XML files. Smarter people appear to have tried to do this http://opensdmxdevelopers.wikispaces.com/RSDMX I can’t find this package for download on its homepage here https://r-forge.r-project.org/projects/rsdmx/ (I can’t see any link/download section but maybe I’m blind) and it seems to be early stages. The existence of the rsdmx suggests using the xml package to read sdmx might not be easy so I’m ready to give up at this stage unless anyone has had success with this. Actually I’m mainly interested in reading this file http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml But this is a 10mb file so I was starting smaller.

edit3 attempting sgibb's answer on large file using changes in Mischa's comment library("XML")

url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"

    sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA
  ## temp value to store current REF_AREA
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA

  ## handler subroutine for Obs tag
  Obs <- function(name, attr) {
    ## found an Obs tag and now fill data.frame
    data[i, "refArea"] <<- refArea
    data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"])
    data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"])
    data[i, "bsItem"] <<- bsItem
    data[i, "bsCountSector"] <<- bsCountSector
    i <<- i + 1
  }

  ## handler subroutine for Series tag
  Series <- function(name, attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- as.character(attr["BS_ITEM"])
    bsCountSector <<- as.numeric(attr["BS_ITEM"])
  }
  return(list(getData=function() {return(data)},
              Obs=Obs, Series=Series))
}

## run parser
df <- xmlEventParse(file(url), handlers=sdmxHandler())$getData()
Specification mandate value for attribute OBS_VALUE
attributes construct error
Couldn't find end of Start Tag Obs line 15108
Premature end of data in tag Series line 15041
Premature end of data in tag DataSet line 91
Premature end of data in tag CompactData line 2
Error: 1: Specification mandate value for attribute OBS_VALUE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 15108
4: Premature end of data in tag Series line 15041
5: Premature end of data in tag DataSet line 91
6: Premature end of data in tag CompactData line 2
In addition: There were 50 or more warnings (use warnings() to see the first 50)

edit2: the answer from sgibb looks ideal and works perfectly on the smaller file. I tried to run it on

url <- http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml

(the 10mb file, original link corrected), with the only modification being the addition of two lines:

data[i, "bsItem"] <<- as.character(attr["BS_ITEM"])

data[i, "bsCountSector"] <<- as.numeric(attr["BS_COUNT_SECTOR"])

(these are additional id variables which are needed to identify a row in this larger dataset). It ran for a few minutes then finished with this error:

Error: 1: Specification mandate value for attribute TIME_PE
2: attributes construct error
3: Couldn't find end of Start Tag Obs line 20743
4: Premature end of data in tag Series line 20689
5: Premature end of data in tag DataSet line 91 6: Premature end of data in tag CompactData line 2

In addition: There were 50 or more warnings (use warnings() to see the first 50)

The basic format of the data seems very similar so I thought this might work. The basic format of the 10mb file is as below:

    <Series FREQ="M" REF_AREA="AT" ADJUSTMENT="N" BS_REP_SECTOR="A" BS_ITEM="A20" MATURITY_ORIG="A" DATA_TYPE="1" COUNT_AREA="U2" BS_COUNT_SECTOR="0000" CURRENCY_TRANS="Z01" BS_SUFFIX="E" TIME_FORMAT="P1M" COLLECTION="E">
        <Obs TIME_PERIOD="1997-09" OBS_VALUE="275.3" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-10" OBS_VALUE="275.9" OBS_STATUS="A" OBS_CONF="F"/>
        <Obs TIME_PERIOD="1997-11" OBS_VALUE="276.6" OBS_STATUS="A" OBS_CONF="F"/>

edit1:

desired data format:

Ref_area    time_period obs_value

At  2006    118    
At  2007    119    
…    
Be  2006    101    
…

Here’s the first bit of the data.

    </Header>
    DataSet xsi:schemaLocation="https://www.ecb.europa.eu/vocabulary/stats/icpf/1 https://www.ecb.europa.eu/stats/sdmx/icpf/1/structure/2011-08-11/sdmx-compact.xsd" xmlns="https://www.ecb.europa.eu/vocabulary/stats/icpf/1"> 
<Group DECIMALS="0" TITLE_COMPL="Austria, reporting institutional sector Insurance corporations and pension funds - Closing balance sheet - All financial assets and liabilities - counterpart area World (all entities), counterpart institutional sector Total economy including Rest of the World (all sectors) - Credit (resources/liabilities) - Non-consolidated, Current prices - Euro, Neither seasonally nor working day adjusted - ESA95 TP table Not applicable" UNIT_MULT="9" UNIT="EUR" ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT"/><Series ESA95TP_SUFFIX="Z" ESA95TP_DENOM="E" ESA95TP_CONS="N" ESA95TP_DC_AL="2" ESA95TP_CPSECTOR="S" ESA95TP_CPAREA="A1" ESA95TP_SECTOR="S125" ESA95TP_ASSET="F" ESA95TP_TRANS="LE" ESA95TP_PRICE="V" ADJUSTMENT="N" REF_AREA="AT" COLLECTION="E" TIME_FORMAT="P1Y" FREQ="A"><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="112" TIME_PERIOD="2008"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="119" TIME_PERIOD="2009"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="125" TIME_PERIOD="2010"/><Obs OBS_CONF="F" OBS_STATUS="E" OBS_VALUE="127" TIME_PERIOD="2011"/></Series><Group D
Christophe Roussy
  • 16,299
  • 4
  • 85
  • 85
Aidan
  • 109
  • 1
  • 7
  • forgot to paste the initial lines of the file. first there's a long header i don't need, starts like this: – Aidan Aug 13 '12 at 10:17
  • What data do you need? Do you want the obs data per series or per group? Could you give an example of how you want the data? And how you'd like to preserve the relations? – Mischa Vreeburg Aug 13 '12 at 11:15
  • These three items would identify one row: REF_AREA="AT" TIME_PERIOD="2008" OBS_VALUE="119" So a format like the below would be ideal. Ref_area time_period obs_value At 2006 118 At 2007 119 … Be 2006 101 … The xml file has the all data for a given ref_area then moves on to the next ref_area. This file has only one variable. The second link I included at the end has several variables. In that case I would need an extra column with a variable id, then the above format repeated within each variable. – Aidan Aug 13 '12 at 11:50
  • sorry, that formatting wasn't clear so I edited the original post. – Aidan Aug 13 '12 at 11:54

3 Answers3

5

RSDMX seems to be in an early development state. IMHO there is no package available yet. But you could easily implement it on your own using the XML package. I would suggest to use xmlEventParse (see ?xmlEventParse for details):

EDIT: adapt example to changed requirements of outstanding_amounts.xml
EDIT2: add download.file

library("XML")

#url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml"
url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml"

## download xml file to avoid download errors disturbing xmlEventParse
tmp <- tempfile()
download.file(url, tmp) 

sdmxHandler <- function() {
  ## data.frame which stores results
  data <- data.frame(stringsAsFactors=FALSE)
  ## counter to store current row
  i <- 1
  ## temp value to store current REF_AREA, BS_ITEM and BS_COUNT_SECTOR
  refArea <- NA
  bsItem <- NA
  bsCountSector <- NA

  ## handler subroutine for Obs tag
  Obs <- function(name, attr) {
    ## found an Obs tag and now fill data.frame
    data[i, "refArea"] <<- refArea
    data[i, "bsItem"] <<- bsItem
    data[i, "bsCountSector"] <<- bsCountSector
    data[i, "timePeriod"] <<- as.Date(paste(attr["TIME_PERIOD"], "-01", sep=""), format="%Y-%m-%d")
    data[i, "obsValue"] <<- as.double(attr["OBS_VALUE"])
    ## update current row
    i <<- i + 1
  }

  ## handler subroutine for Series tag
  Series <- function(name, attr) {
    refArea <<- attr["REF_AREA"]
    bsItem <<- attr["BS_ITEM"]
    bsCountSector <<- as.numeric(attr["BS_COUNT_SECTOR"])
  }

  return(list(getData=function() {return(data)},
              Obs=Obs, Series=Series))
}

## run parser
df <- xmlEventParse(tmp, handlers=sdmxHandler())$getData()

head(df)
#  refArea bsItem bsCountSector timePeriod obsValue
#1      DE    A20          2210      12053     39.6
#2      DE    A20          2210      12084     46.1
#3      DE    A20          2210      12112     50.2
#4      DE    A20          2210      12143     52.0
#5      DE    A20          2210      12173     52.3
#6      DE    A20          2210      12204     47.3
sgibb
  • 25,396
  • 3
  • 68
  • 74
  • this answer does exactly what I need on the smaller file. I tried to run it on a larger file (see edit2 in the post above) but it seems I may need to do a little more work to apply it to this file – Aidan Aug 13 '12 at 14:55
  • 1
    Use these Changes ## temp value to store current REF_AREA refArea <- NA bsItem <- NA bsCountSector <- NA Series <- function(name, attr) { refArea <<- attr["REF_AREA"] bsItem <<- as.character(attr["BS_ITEM"]) bsCountSector <<- as.numeric(attr["BS_ITEM"]) } Obs <- function(name, attr) { ## found an Obs tag and now fill data.frame data[i, "refArea"] <<- refArea data[i, "timePeriod"] <<- as.numeric(attr["TIME_PERIOD"]) data[i, "obsValue"] <<- as.numeric(attr["OBS_VALUE"]) data[i, "bsItem"] <<- bsItem data[i, "bsCountSector"] <<- bsCountSector – Mischa Vreeburg Aug 13 '12 at 15:22
  • thanks Mischa. I edited the post with your changes, however I still run into an error. Maybe it is because the file is too big? – Aidan Aug 13 '12 at 16:00
  • @Aidan: I fixed my example code. Please note: it took very long time. I used a small file with only 5000 lines out of outstanding_amounts.xml for testing. – sgibb Aug 13 '12 at 17:22
  • Thanks @sgibb for expanding your code for me. Trying it out on the full file I still get an error like the one in Edit3 of the post above, although slightly different every time. I get the error after processing for around two minutes. It says ... Specification mandate value for attribute O attributes construct error Couldn't find end of Start Tag Obs line 14137... or after running again, not quite the same message: ... Specification mandate value for attribute OBS_ attributes construct error Couldn't find end of Start Tag Obs line 15549. – Aidan Aug 14 '12 at 07:38
  • @Aidan: Works for me. Maybe your file isn't downloaded correctly/completely. That's why I add the `download.file` part to ensure we get the whole file. – sgibb Aug 14 '12 at 09:12
  • Thanks @sgibb, that seems to have fixed it. it is still running after about 150 minutes. The sdmx standard seems pretty bad from the user's point of view, but it's great to have your code since many organisations are using sdmx. I'll update when it finishes. – Aidan Aug 14 '12 at 12:12
  • Just to say I ran @sgibb's code on the whole file and it worked perfectly. It took 5hours (18970seconds) using 4gb of RAM, Win7 32bit, and R 2.14. I have to wonder about the long term future of sdmx given the time needed, but for now this code will allow me to read a lot of different sdmx files so thanks for all the help! – Aidan Aug 15 '12 at 07:31
  • Isn't the main reason this is slow that it is iteratively building up the "data" object one row at a time, instead of allocating a large enough object first off and then filling it in? – Peter Ellis Oct 15 '13 at 03:34
  • @PeterEllis: Sure, but do you know how to get the number of rows before reading the complete file? – sgibb Oct 15 '13 at 06:50
  • Hello all, the package rsdmx is no longer at early stage, and you can now access it from CRAN. http://cran.r-project.org/web/packages/rsdmx/index.html See below post. – eblondel Jan 19 '15 at 17:27
3

The package rsdmx allows you to read SDMX-ML files and coerce them as data.frame. It is now hosted at Github, and currently available in CRAN, but in case you can install easily it from GitHub with the following:

require("devtools")
install_github("rsdmx", "opensdmx")

Applying to your data, you can do the following:

sdmx <- readSDMX("http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml")
df <- as.data.frame(sdmx)

More examples are given in the rsdmx wiki

Note that its functionalities currently load the xml object into R, as a slot part of the SDMX R objects instantiated by rsdmx. In the future, we would like to investigate how rsdmx can use xmlEventParse (as suggested above by @sgibb) to read very large datasets.

eblondel
  • 603
  • 4
  • 10
0
library(XML)

xmlparsed <- xmlParse(file(url))

## obtain dataset node::
series_data <- getNodeSet(xmlparsed, "//Series")

if(length(series_data)==0){

datasetnode <- xmlChildren( xmlChildren(xmlparsed)[[1]])[[2]]

series_data<-xmlChildren(datasetnode)[ names(xmlChildren(datasetnode))=="Series"]

}

## prepare dataset

dataset.frame <- data.frame(matrix(ncol=3))
colnames(dataset.frame) <- c('REF_AREA', 'TIME_PERIOD', 'OBS_VALUE')
## loop over data

counter=1
for (i in 1: length(series_data)){
  if('Obs'%in%names(xmlChildren(series_data[[i]])) ){ ## To ignore empty //Series nodes
    for (j in 1: length(xmlChildren(series_data[[i]]))){
      dataset.frame[counter,1] <- xmlAttrs(series_data[[i]])['REF_AREA']
      dataset.frame[counter,2] <- xmlAttrs(series_data[[i]][[j]])['TIME_PERIOD']
      dataset.frame[counter,3] <- xmlAttrs(series_data[[i]][[j]])['OBS_VALUE']
      counter=counter+1
    }
  }
}


head(dataset.frame,5)
Mischa Vreeburg
  • 1,576
  • 1
  • 13
  • 18
  • Thanks Mischa! Forgive my ignorance but how do I pass the url argument to file in the second line? It doesn’t seem to want url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml" or url <- "http://www.ecb.europa.eu/stats/sdmx/bsi/1/data/outstanding_amounts.xml" (Error in file.exists(file) : invalid 'file' argument) – Aidan Aug 13 '12 at 15:10
  • Sorry forgot line url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml" – Mischa Vreeburg Aug 13 '12 at 15:11
  • Add these 2 lines befor counter=counter+1 dataset.frame[counter,4] <- xmlAttrs(series_data[[i]])['BS_ITEM'] dataset.frame[counter,5] <- xmlAttrs(series_data[[i]])['BS_ITEM'] and change the ncol=3 to 5, then your file should work. – Mischa Vreeburg Aug 13 '12 at 15:19
  • I'm probably missing something very obvious and basic but I still get an error after either url <- "ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml" or url <- "http://www.ecb.europa.eu/stats/sdmx/icpf/1/data/pension_funds.xml" and xmlparsed <- xmlParse(file(url)) Error in file.exists(file) : invalid 'file' argument – Aidan Aug 13 '12 at 15:52