1

I´m trying to analize meteorological data, importing directly to R from it´s native structure in XML. But it seems to be a very complicated XML format not corresponding to the commonly used standard of "one observation per row". The provider of the data has grouped the variables by every ten minutes intervals registered.

Here is a piece of the XML code:

<?xml version= "1.0" encoding="ISO-8859-1" ?>
<mes xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="C069_2018_1.xsd">
    <dia Dia="2018-1-01">
        <hora Hora="00:00">
            <Meteoros>
                <Dir.Med._a_1800cm>250.5</Dir.Med._a_1800cm>
                <Humedad._a_170cm>43.94</Humedad._a_170cm>
                <Irradia.._a_273cm>0.0</Irradia.._a_273cm>
                <Precip.._a_144cm>0.0</Precip.._a_144cm>
                <Sig.Dir._a_1800cm>17.82</Sig.Dir._a_1800cm>
                <Sig.Vel._a_1800cm>2.78</Sig.Vel._a_1800cm>
                <Tem.Aire._a_170cm>12.57</Tem.Aire._a_170cm>
                <Vel.Max._a_1800cm>15.48</Vel.Max._a_1800cm>
                <Vel.Med._a_1800cm>8.6</Vel.Med._a_1800cm>
            </Meteoros>
        </hora>
        <hora Hora="00:10">
            <Meteoros>
                <Dir.Med._a_1800cm>249.3</Dir.Med._a_1800cm>
                <Humedad._a_170cm>44.65</Humedad._a_170cm>
                <Irradia.._a_273cm>0.0</Irradia.._a_273cm>
                <Precip.._a_144cm>0.0</Precip.._a_144cm>
                <Sig.Dir._a_1800cm>20.21</Sig.Dir._a_1800cm>
                <Sig.Vel._a_1800cm>2.32</Sig.Vel._a_1800cm>
                <Tem.Aire._a_170cm>12.55</Tem.Aire._a_170cm>
                <Vel.Max._a_1800cm>14.5</Vel.Max._a_1800cm>
                <Vel.Med._a_1800cm>7.8</Vel.Med._a_1800cm>
            </Meteoros>
        </hora>
        <hora Hora="00:20">
            <Meteoros>
                <Dir.Med._a_1800cm>250.3</Dir.Med._a_1800cm>
                <Humedad._a_170cm>46.17</Humedad._a_170cm>
                <Irradia.._a_273cm>0.0</Irradia.._a_273cm>
                <Precip.._a_144cm>0.0</Precip.._a_144cm>
                <Sig.Dir._a_1800cm>23.02</Sig.Dir._a_1800cm>
                <Sig.Vel._a_1800cm>2.25</Sig.Vel._a_1800cm>
                <Tem.Aire._a_170cm>12.45</Tem.Aire._a_170cm>
                <Vel.Max._a_1800cm>13.72</Vel.Max._a_1800cm>
                <Vel.Med._a_1800cm>5.55</Vel.Med._a_1800cm>
            </Meteoros>
        </hora>
...

And here is the full XML for the data of january 2019 (>60 mb):

http://opendata.euskadi.eus/contenidos/ds_meteorologicos/met_stations_ds_2018/opendata/2018/C069/C069_2018_1.xml

When I used the function "xmlTreeParse" I got the error:

"Error: XML content does not seem to be XML"

It´s my first attempt with XML data structure, but I´ve been trying the recomendations of similar questions on this site as:

Transforming data from xml into R dataframe

xml to dataframe in r

R XML to Dataframe

But those seem to be simple XML structures that works just fine parsing directly or even converting directly to dataframes with the libraries "XML" and "methods"

I need to obtain a dataframe with similar structure to this:

dia hora    Dir.Med._a_1800cm   Humedad._a_170cm    Irradia.._a_273cm   Precip.._a_144cm    Sig.Dir._a_1800cm   Sig.Vel._a_1800cm   Tem.Aire._a_170cm   Vel.Max._a_1800cm   Vel.Med._a_1800cm
01/01/2018  0:00    250.5   43.94   0.0 0.0 17.82   2.78    12.57   15.48   8.6
01/01/2018  0:10    249.3   44.65   0.0 0.0 20.21   2.32    12.55   14.5    7.8
01/01/2018  0:20    250.3   46.17   0.0 0.0 23.02   2.25    12.45   13.72   5.55
Mendeku
  • 25
  • 4
  • 1
    Complicated XML generally can't be parsed to data frame directly. Read the XML document to a list or to a XML document object in R first. The use XPath Syntax to extract nodes you need. I recommend you to try`xml2` package. – yusuzech Sep 10 '19 at 17:30

2 Answers2

0

Consider XSLT, special-purpose language designed to transform XML files and sibling to XPath, to migrate the top level dia and hora nodes into Meteoros level regardless of number of nodes:

XSLT (save as .xsl, a special .xml file)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="xml" indent="yes" />
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TRANFORM -->
    <xsl:template match="@*|node()">
        <xsl:copy>
          <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- REPEAT dia and hora NODES -->
    <xsl:template match="Meteoros">
        <xsl:copy>
            <dia><xsl:value-of select="ancestor::dia/@Dia"/></dia>
            <hora><xsl:value-of select="ancestor::hora/@Hora"/></hora>    
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Online demo

R (no loops or mapping needed)

library(XML)
library(xslt)

doc <- read_xml("Import.xml", package = "xslt")
style <- read_xml("Script.xsl", package = "xslt")

new_xml <- xml_xslt(doc, style)

new_doc <- XML::xmlParse(new_xml)    
meteoros_df <- XML::xmlToDataFrame(nodes=getNodeSet(new_doc, "//Meteoros"))
Parfait
  • 104,375
  • 17
  • 94
  • 125
0

It's quite some work but not impossible. This solution will also work with different number of observations per day.


dia <- xmlstr %>% read_xml() %>%  xml_find_all("//dia")
dia.dat <- dia %>% map(xml_attrs) %>% map(~t(.) %>% as_tibble)

hora <- dia %>% map(xml_children) 
hora.dat <- hora %>% map(xml_attrs) %>% map(~map_df(., ~t(.) %>% as_tibble))

hora.details <- hora %>% 
  map(~map(.,xml_children) %>% 
        map(xml_children) %>% 
        map(~setNames(xml_text(.), xml_name(.)) %>% t() %>% as_tibble)) %>% map(.,~do.call(rbind,.) %>% as_tibble)

pmap_df(list(dia.dat, hora.dat, hora.details),cbind)

I added some data to your xml example to test. (1 extra day and 2nd day an extra hour).

Result:


        Dia  Hora Dir.Med._a_1800cm Humedad._a_170cm Irradia.._a_273cm Precip.._a_144cm Sig.Dir._a_1800cm
1 2018-1-01 00:00             250.5            43.94               0.0              0.0             17.82
2 2018-1-01 00:10             249.3            44.65               0.0              0.0             20.21
3 2018-1-01 00:20             250.3            46.17               0.0              0.0             23.02
4 2018-1-02 00:00             250.5            43.94               0.0              0.0             17.82
5 2018-1-02 00:10             249.3            44.65               0.0              0.0             20.21
6 2018-1-02 00:20             250.3            46.17               0.0              0.0             23.02
7 2018-1-02 00:30             250.3            46.17               0.0              0.0             23.02
  Sig.Vel._a_1800cm Tem.Aire._a_170cm Vel.Max._a_1800cm Vel.Med._a_1800cm
1              2.78             12.57             15.48               8.6
2              2.32             12.55              14.5               7.8
3              2.25             12.45             13.72              5.55
4              2.78             12.57             15.48               8.6
5              2.32             12.55              14.5               7.8
6              2.25             12.45             13.72              5.55
7              2.25             12.45             13.72              5.55

credits to answers of:

converting XML nodes to a dataframe

Getting all the children nodes of XML file to data.frame or data.table

Wietze314
  • 5,942
  • 2
  • 21
  • 40
  • Thank you very much. It works perfectly. I´ve tested with the full size xml file and it works. – Mendeku Sep 11 '19 at 16:01
  • @Wietze314 I posted a question that could use your answer. But failing for some mistake that I haven't identified yet. - https://stackoverflow.com/questions/59307289/transform-xml-with-many-roots-into-dataframe – Bruno Avila Dec 12 '19 at 18:40