4

I'm trying to convert an xml file to a dataframe, but the format seems to be off. I've looked at different tutorials and, while I've been moderately succesful at getting the information I need using a for loop and navigating the parsed file, I've been told that this solution is not very efficient.

I tried this code then:

require(XML)
parsed<-xmlParse("SEWL.xml")
xmlToDataFrame(parsed)

But it gives an error: Error in [<-.data.frame(*tmp*, i, names(nodes[[i]]), value = c("\"LL18179\"\"2016/08\"0.32485.43896.59801.2131\"OK\"", : duplicate subscripts for columns

This other code works, but the formatting is not what I need:

require(XML)
require(plyr)
pldf<-ldply(xmlToList("SEWL.xml"),data.frame)

The resulting dataframe is as follows:

          .id              X..i.. text  .attrs test.code test.validuntil test.meas.text test.meas..attrs test.meas.text.1
1  technician              "John" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
2    location                "CO" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
3        temp                <NA> 21.3 celsius      <NA>            <NA>           <NA>             <NA>             <NA>
4     runtype           "routine" <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
5      sample                <NA> <NA>    2323 "LL18179"       "2016/08"         0.3248         baseline           5.4389
6      sample                <NA> <NA>    2323 "LL18179"       "2016/08"         0.3248         baseline           5.4389
7      sample                <NA> <NA> 8979237 "AA09453"       "2016/03"         0.0117         baseline           5.6012
8      sample                <NA> <NA> 8979237 "AA09453"       "2016/03"         0.0117         baseline           5.6012
9      .attrs 2015_07_31_11_33_22 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
10     .attrs            20150731 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
11     .attrs              113322 <NA>    <NA>      <NA>            <NA>           <NA>             <NA>             <NA>
   test.meas..attrs.1 test.meas.text.2 test.meas..attrs.2 test.calc test.result test..attrs test.code.1 test.validuntil.1
1                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
2                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
3                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
4                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
5                 std           6.5980               data    1.2131        "OK"      laslum "ATR150607"         "2017/05"
6                 std           6.5980               data    1.2131        "OK"           3 "ATR150607"         "2017/05"
7                 std           1.1431               data    0.2041      "FAIL"       absat        <NA>              <NA>
8                 std           1.1431               data    0.2041      "FAIL"           2        <NA>              <NA>
9                <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
10               <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
11               <NA>             <NA>               <NA>      <NA>        <NA>        <NA>        <NA>              <NA>
   test.meas.text.3 test.meas..attrs.3 test.meas.text.4 test.meas..attrs.4 test.meas.text.5 test.meas..attrs.5
1              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
2              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
3              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
4              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
5            0.0673           baseline           4.9721                std          10.3851               data
6            0.0673           baseline           4.9721                std          10.3851               data
7              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
8              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
9              <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
10             <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
11             <NA>               <NA>             <NA>               <NA>             <NA>               <NA>
   test.calc.1 test.result.1 test..attrs.1
1         <NA>          <NA>          <NA>
2         <NA>          <NA>          <NA>
3         <NA>          <NA>          <NA>
4         <NA>          <NA>          <NA>
5       2.0886     "Warning"           atr
6       2.0886     "Warning"             1
7         <NA>          <NA>          <NA>
8         <NA>          <NA>          <NA>
9         <NA>          <NA>          <NA>
10        <NA>          <NA>          <NA>
11        <NA>          <NA>          <NA>

This is the example XML file that I'm using:

<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>

And the dataframe that I'm hoping to get:

  experiment technician location temp runtype  sample   test order      code validuntil baseline    std    data   calc  result     date   time
1     abc123       John       CO 21.3 routine    2323 laslum     3   LL18179    2016/08   0.3248 5.4389  6.5980 1.2131      OK 20150731 113322
2     abc123       John       CO 21.3 routine    2323    atr     1 ATR150607    2017/05   0.0673 4.9721 10.3851 2.0886 Warning 20150731 113322
3     abc123       John       CO 21.3 routine 8979237  absat     2   AA09453    2016/03   0.0117 5.6012  1.1431 0.2041    FAIL 20150731 113322

I don't need the exact same format, just something close enough so I can transform it into the example.

Uwe
  • 41,420
  • 11
  • 90
  • 134
Variax
  • 1,057
  • 2
  • 11
  • 16

1 Answers1

6

We provide two approaches to parsing the XML. The first (performing a triple iteration over experiment/sample/test) would likely run faster but the second (using a single loop over the test nodes and at each test node reaching back up through the tree to grab its ancestors) has simpler code.

1) Using Lines in the Note at the end we implement a triple xpathApply/xpathSApply iteration over experiment/sample/test nodes. e, s and t represent the current such node, respectively.

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//experiment", function(e) {
  data.frame(experiment = xmlAttrs(e)[["name"]],
       technician = xmlValue(e[["technician"]]),
       location = xmlValue(e[["location"]]),
       temp = xmlValue(e[["temp"]]),
       runtype = xmlValue(e[["runtype"]]),
       t(do.call(cbind, xpathApply(e, "sample", function(s) {
            sample <- xmlAttrs(s)[["id"]]
            xpathSApply(s, "test", function(t) {
                   c(sample = sample,
                        test = xmlAttrs(t)[["name"]],
                        order = xmlAttrs(t)[["order"]],
                        code = xmlValue(t[["code"]]),
                        validuntil = xmlValue(t[["validuntil"]]),
                        baseline = xmlValue(t["meas"][[1]]),
                        std = xmlValue(t["meas"][[2]]),
                        data = xmlValue(t["meas"][[3]]),
                        calc = xmlValue(t[["calc"]]),
                        result = xmlValue(t[["result"]])
             )})}))),
       date = xmlAttrs(e)[["date"]],
       time = xmlAttrs(e)[["time"]]
)}))

giving:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

2) This is an alternate approach in which we loop only over the test nodes and then reach upward into the parent and grandparent to get the corresponding sample and experiement info.

library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

do.call("rbind", xpathApply(doc, "//test", function(t) { # t is test node
        s <- xmlParent(t) # s is sample node
        e <- xmlParent(s) # e is experiment node
        data.frame(experiment = xmlAttrs(e)[["name"]],
          technician = xmlValue(e[["technician"]]),
          location = xmlValue(e[["location"]]),
          temp = xmlValue(e[["temp"]]),
          runtype = xmlValue(e[["runtype"]]),
          sample = xmlAttrs(s)[["id"]],
          test = xmlAttrs(t)[["name"]],
          order = xmlAttrs(t)[["order"]],
          code = xmlValue(t[["code"]]),
          validuntil = xmlValue(t[["validuntil"]]),
          baseline = xmlValue(t["meas"][[1]]),
          std = xmlValue(t["meas"][[2]]),
          data = xmlValue(t["meas"][[3]]),
          calc = xmlValue(t[["calc"]]),
          result = xmlValue(t[["result"]]),
          date = xmlAttrs(e)[["date"]],
          time = xmlAttrs(e)[["time"]]
       )
}))

giving:

  experiment technician location temp   runtype  sample   test order
1     abc123     "John"     "CO" 21.3 "routine"    2323 laslum     3
2     abc123     "John"     "CO" 21.3 "routine"    2323    atr     1
3     abc123     "John"     "CO" 21.3 "routine" 8979237  absat     2
         code validuntil baseline    std    data   calc    result     date
1   "LL18179"  "2016/08"   0.3248 5.4389  6.5980 1.2131      "OK" 20150731
2 "ATR150607"  "2017/05"   0.0673 4.9721 10.3851 2.0886 "Warning" 20150731
3   "AA09453"  "2016/03"   0.0117 5.6012  1.1431 0.2041    "FAIL" 20150731
    time
1 113322
2 113322
3 113322

Note 1:

As an aside if you read the input XML file, SEWL.xml, into Excel it will do a reasonable job of putting it into a tabular format although some further processing would be needed to get it into precisely into the form in the question.

Note 2:

The input Lines as an R object is:

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<experiment name="abc123" date="20150731" time="113322">
    <technician>"John"</technician>
    <location>"CO"</location>
    <temp scale="celsius">21.3</temp>
    <runtype>"routine"</runtype>
    <sample id="2323">
        <test name="laslum" order="3">
            <code>"LL18179"</code>
            <validuntil>"2016/08"</validuntil>
            <meas name="baseline">0.3248</meas>
            <meas name="std">5.4389</meas>
            <meas name="data">6.5980</meas>
            <calc>1.2131</calc>
            <result>"OK"</result>
        </test>
        <test name="atr" order="1">
            <code>"ATR150607"</code>
            <validuntil>"2017/05"</validuntil>
            <meas name="baseline">0.0673</meas>
            <meas name="std">4.9721</meas>
            <meas name="data">10.3851</meas>
            <calc>2.0886</calc>
            <result>"Warning"</result>
        </test>
    </sample>
    <sample id="8979237">
        <test name="absat" order="2">
            <code>"AA09453"</code>
            <validuntil>"2016/03"</validuntil>
            <meas name="baseline">0.0117</meas>
            <meas name="std">5.6012</meas>
            <meas name="data">1.1431</meas>
            <calc>0.2041</calc>
            <result>"FAIL"</result>
        </test>
    </sample>
</experiment>'
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • This seems to be in the right direction. How can I replace the Lines object by a call to the actual XML file? – Variax Nov 24 '16 at 16:05
  • Remove `asText=TRUE` and use the filename in place of `Lines`. For showing on SO we use the character string input to keep the presentation self-contained. – G. Grothendieck Nov 24 '16 at 16:09
  • Have added a second approach. First would likely be faster but second has simpler code. – G. Grothendieck Nov 24 '16 at 23:41