0

I have the following xml:

parsed <- 
<div class="Matches">
<div class="Match">
<div class="MatchType">Singles Match</div>
<div class="MatchResults">
<a href="?id=2&amp;nr=11408&amp;name=Jason+Jordan">Jason Jordan</a> (w/<a href="?id=2&amp;nr=2250&amp;name=Seth+Rollins">Seth Rollins</a>) defeats <a href="?id=2&amp;nr=257&amp;name=Cesaro">Cesaro</a> (w/<a href="?id=2&amp;nr=2641&amp;name=Sheamus">Sheamus</a>) (13:15)</div>
</div>
<div class="Match">
<div class="MatchRecommended">[<span class="TextHighlight"><a href="?id=111&amp;nr=9099">Recommended, Meltzer: ***3/4, CAGEMATCH users: <span class=" Rating Color7">7.17</span></a></span>]</div>
<div class="MatchType">
<a href="?id=5&amp;nr=16">WWE Intercontinental Title</a> Match</div>
<div class="MatchResults">
<a href="?id=2&amp;nr=9967&amp;name=Roman+Reigns">Roman Reigns</a> (c) defeats <a href="?id=2&amp;nr=676&amp;name=Samoa+Joe">Samoa Joe</a> (24:50)            </div>

I am trying to pull out the section for class "MatchRecommended" and have it list "NA" for those children that do not have class "MatchRecommended".

I think I have to use xpathSApply along with xmlChildren to extract the relevant data but with my code below, I only get NAs:

xpathSApply(parsed, "//*[(@class = 'Match')]", function(x) ifelse(is.null(xmlChildren(x)$a), NA, xmlAttrs(xmlChildren(x)$a, 'href')))
[1] NA NA NA NA NA NA NA

Ideally, the result would look like:

[1] NA "Recommended, Meltzer: ***3/4, CAGEMATCH users: 7.17"

Any thoughts on how to do this?

1 Answers1

0

I would get the Match nodes and then query the node set use a leading "." so its relative to the current node.

parsed <- xmlParse('<div...rest of your XML plus two missing div tags')
nodes <- getNodeSet(parsed, "//div[(@class = 'Match')]")
x <- lapply(nodes, xpathSApply, ".//div[(@class = 'MatchRecommended')]", xmlValue, trim=TRUE)
x

[[1]]
list()

[[2]]
[1] "[Recommended, Meltzer: ***3/4, CAGEMATCH users: 7.17]"

There are a few ways to replace that empty list with NA.

sapply(x, function(y) ifelse(length(y)==0, NA, y))
[1] NA  "[Recommended, Meltzer: ***3/4, CAGEMATCH users: 7.17]"

You can also use the xml2 package since that returns NAs and not empty lists.

library(xml2)
parsed <- read_xml('<div...')
nodes <-  xml_find_all(parsed, "//div[(@class = 'Match')]")  
sapply(nodes, function(x) xml_text( xml_find_first(x, ".//div[(@class = 'MatchRecommended')]"), trim=TRUE)) 
Chris S.
  • 2,185
  • 1
  • 14
  • 14