Parse XML based on attributes and text values of related nodes

Question

I have used the XML package to parse both HTML and XML before, and have a rudimentary grasp of xPath. However I've been asked to consider XML data where the important bits are determined by a combination of text and attributes of the elements themselves, as well as those in related nodes. I've never done that. For example

[updated example, slightly more expansive]

<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>

I would like to extract all author names where: 1) the location element has a text value that contains "NY" 2) the author element does NOT contain a "ref" attribute; that is where ref is not present in the author tag

I will ultimately need to concatenate the extracted authors together within a given bookstore, so that my resulting data frame is one row per store. I'd like to preserve the bookstore id as an additional field in my data frame so that I can uniqely reference each store. Since only the first bokstore is in NY, results from this simple example would look something like:

1 Jane Smith John Doe Karl Pearson William Gosset

If another bookstore contained "NY" in its location, it would comprise the second row, and so forth.

Am I asking too much of R to parser under these convoluted conditions?

user1609452 · Accepted Answer · 2013-04-26T19:39:34.163

require(XML)

xdata <- xmlParse(apptext)
xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')
#[[1]]
#<author>Jane Smith</author> 

#[[2]]
#<author>John Doe</author> 

#[[3]]
#<author>Karl Pearson</author> 

#[[4]]
#<author>William Gosset</author>

Breakdown:

Get all locations containing 'NY'

//*/location[text()[contains(.,"NY")]]

Get the books sibling of these nodes

/following-sibling::books

from these notes get all authors without a ref attribute

/.//author[not(@ref)]

Use xmlValue if you want the text:

> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
[1] "Jane Smith"     "John Doe"       "Karl Pearson"   "William Gosset"

UPDATE:

child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
    xpathSApply(x,'.//ancestor::bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
# id  id  id  id 
#"1" "1" "1" "1"

UPDATE 2:

With your changed data

xdata <- '<Catalogue>
<Bookstore id="ID910705541">
  <location>foo bar</location>
  <books>
    <book category="A" id="1">
        <title>Alpha</title>
        <author ref="1">Matthew</author>
        <author>Mark</author>
        <author>Luke</author>
        <author ref="2">John</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Beta</title>
        <author ref="1">Huey</author>
        <author>Duey</author>
        <author>Louie</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Gamma</title>
        <author ref="1">Tweedle Dee</author>
        <author ref="2">Tweedle Dum</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
  </Bookstore> 
<Bookstore id="ID910700051">
  <location>foo</location>
  <books>
    <book category="A" id="1">
        <title>Happy</title>
        <author>Dopey</author>
        <author>Bashful</author>
        <author>Doc</author>
        <author ref="1">Grumpy</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Ni</title>
        <author ref="1">John</author>
        <author ref="2">Paul</author>
        <author ref="3">George</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>San</title>
        <author ref="1">Ringo</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
<Bookstore id="ID910715717">
    <location>bar</location>
  <books>
    <book category="A" id="1">
        <title>Un</title>
        <author ref="1">Winkin</author>
        <author>Blinkin</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="B" id="10">
        <title>Deux</title>
        <author>Nod</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="D" id="100">
        <title>Trois</title>
        <author>Manny</author>
        <author>Moe</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
  </books>
 </Bookstore> 
</Catalogue>'

Note previously you had bookstore now Bookstore. NY is gone so I have used foo

require(XML)
xdata <- xmlParse(xdata)
child.nodes <- getNodeSet(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
  xpathSApply(x,'.//ancestor::Bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
#           id            id            id            id            id 
#"ID910705541" "ID910705541" "ID910705541" "ID910705541" "ID910700051" 
#           id            id 
#"ID910700051" "ID910700051"

xpathSApply(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
# [1] "Mark"    "Luke"    "Duey"    "Louie"   "Dopey"   "Bashful" "Doc"

Oh man, that's great and I've adapted it to my data. I get returned a vector of author names, which I can then concatenate. Problem is, I need to concatenate only those that are associated with their respective bookstore ID. I thought I could just walk the xPath back up to get the ancestor::bookstore and then return @id to another vector. But doing so returns one ID per bookstore, NOT one ID per bookstore per qualifying book. I had expected the latter, which woudl return a vector of the same length as my vector containing author names. Any advice? — Amw 5G, Apr 26 '13 at 15:21
Its a little hard to comment without a concrete example. You could maybe take each node satisfying your conditions and walk back to its ancestor (bookstore) and retrieve the id. I have given an example maybe it will work with your full data. — user1609452, Apr 26 '13 at 15:38
Updated with more expansive data; the question is probably a little out of synch with the new data, so I hope that's not a sticking point. — Amw 5G, Apr 26 '13 at 19:41
Fantastic! And dealing with the discrepancies I introduced was the icing on the cake. Much obliged! — Amw 5G, Apr 26 '13 at 19:48

Parse XML based on attributes and text values of related nodes

1 Answers1

Linked