0

I have a very similar case to this one (Load XML to Dataframe in R with parent node attributes), where I’m trying to convert xml to a df, but I’m unable to deal with the non-existing nodes “sp” and “l”. (I do not care about node “m”). Suppose my xml looks like this:

<text>
<body>
<div1 type="scene1” n="1">
<sp who="fau">
    <l c="30" a="Settle thy studies"/>
    <m x="40" b="To sound the depth of that thou wilt profess"/>
</sp>
<sp who="eang">
        <m x="105" b="Go forward, Faustus, in that famous art"/>
</sp>
</div1>
<div1 type="scene2” n="2">
<sp who="fau">
    <l c="31" a="Settle thy"/>
    <m x="50" b="To sound the depth of"/>
</sp>
<sp who="fau">
    <l c="32" a="Settle"/>
    <m x="60" b="To sound the"/>
</sp>
<sp who="fau">
    <l c="33" a="Settle thy studies, Faustus"/>
    <m x="40" b="To sound the depth of that thou wilt"/>
</sp>
</div1>
<div1 type="scene3” n="3">
</div1>
<div1 type="scene4” n="4">
</div1>
<div1 type="scene5” n="5">
</div1>
</body>
</text>

This is what I would like to obtain:

n   type      lc     la
1   scene1    30     Settle thy studies
2   scene2    31     Settle thy
2   scene2    32     Settle
2   scene2    33     Settle thy studies, Faustus
3   scene3    NA     NA      
4   scene4    NA     NA
5   scene5    NA     NA

I’ve tried this:

doc = xmlTreeParse("play.xml", useInternal = TRUE)

bodyToDF <- function(x){
n <- xmlGetAttr(x, "n")
type <- xmlGetAttr(x, "type")
sp <- xpathApply(x, 'sp', function(sp) {
if(is.null(sp)) {
    lc <- NA
    la <- NA
}
lc <- xpathSApply(sp, 'l', function(l) { xmlGetAttr(l,"c")})
la = xpathSApply(sp, 'l', function(l) { xmlValue(l,"a")})
data.frame(n, type, lc, la)
})
do.call(rbind, sp)  
}


res <- xpathApply(doc, '//div1', bodyToDF)

but it doesn’t work:

Error in data.frame(n, type, lc, la) : 
arguments imply differing number of rows: 1, 0

and also this:

div1 = sapply(c("n","type"), function(x) xpathSApply(doc, "//div1", xmlGetAttr, x), simplify=FALSE)

l = sapply(c("c","a"), function(x) xpathSApply(doc, "//l", xmlGetAttr, x), simplify=FALSE)

df <- data.frame(div1,l)

but I can’t seem to get the correct match between the nodes and df rows:

Error in data.frame(div1, l) : 
arguments imply differing number of rows: 5, 4

Any ideas? Thank you.

Community
  • 1
  • 1
cmvdi01
  • 31
  • 5
  • Flick's solution may help http://stackoverflow.com/questions/25346430/dealing-with-empty-xml-nodes-in-r – Hack-R Sep 19 '16 at 09:10
  • @Hack-R Thanks for the pointer, but it also doesn’t seem to work: `do.call(rbind, lapply(xmlChildren(xmlRoot(doc)), function(x) { data.frame( n=xmlGetNodeAttr(x, "./div1","n",NA), type=xmlGetNodeAttr(x, "./div1","type",NA), lc=xmlGetNodeAttr(x, "./sp/l","c",NA), la=xmlGetNodeAttr(x, "./sp/l","a",NA) ) }))` `n type lc la body.1 1 scene1 NA NA body.2 2 scene2 NA NA body.3 3 scene3 NA NA body.4 4 scene4 NA NA body.5 5 scene5 NA NA` – cmvdi01 Sep 19 '16 at 11:03

1 Answers1

0

Your pasted XML text has issues (some double quotes aren't plain double quotes) so here's a good version of it for others:

txt <- '<text>
    <body>
        <div1 type="scene1" n="1">
            <sp who="fau">
                <l c="30" a="Settle thy studies"/>
                <m x="40" b="To sound the depth of that thou wilt profess"/>
            </sp>
            <sp who="eang">
                <m x="105" b="Go forward, Faustus, in that famous art"/>
            </sp>
        </div1>
        <div1 type="scene2" n="2">
            <sp who="fau">
                <l c="31" a="Settle thy"/>
                <m x="50" b="To sound the depth of"/>
            </sp>
            <sp who="fau">
                <l c="32" a="Settle"/>
                <m x="60" b="To sound the"/>
            </sp>
            <sp who="fau">
                <l c="33" a="Settle thy studies, Faustus"/>
                <m x="40" b="To sound the depth of that thou wilt"/>
            </sp>
        </div1>
        <div1 type="scene3" n="3"></div1>
        <div1 type="scene4" n="4"></div1>
        <div1 type="scene5" n="5"></div1>
    </body>
</text>'

The following can be translated back to XML syntax if truly necessary, but the idea is similar to other answers where you need to inspect each "scene" node and handle the missing values use-case if it occurs:

library(xml2)
library(purrr)
library(dplyr)

doc <- read_xml(txt)

xml_find_all(doc, ".//*[contains(@type, 'scene')]") %>% 
  map_df(function(x) {

    scene <- xml_attr(x, "type")
    num <- xml_attr(x, "n")

    lines <- xml_find_all(x, ".//l")

    if (length(lines) == 0) {
      data_frame(n=num, scene=scene, lc=NA, la=NA)
    } else {
      map_df(lines, function(y) {
        lc <- xml_attr(y, "c") %||% NA
        la <- xml_attr(y, "a") %||% NA
        data_frame(n=num, scene=scene, lc=lc, la=la)
      })
    }

  })

And, that gives you your desired output:

## # A tibble: 7 × 4
##       n  scene    lc                          la
##   <chr>  <chr> <chr>                       <chr>
## 1     1 scene1    30          Settle thy studies
## 2     2 scene2    31                  Settle thy
## 3     2 scene2    32                      Settle
## 4     2 scene2    33 Settle thy studies, Faustus
## 5     3 scene3  <NA>                        <NA>
## 6     4 scene4  <NA>                        <NA>
## 7     5 scene5  <NA>                        <NA>
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205