2

I'm trying to run some scraping where the action I take on a node is conditional on the contents of the node.

This should be a minimal example:

XML =
'<td class="id-tag">
    <span title="Really Long Text">Really L...</span>
</td>
<td class="id-tag">Short</td>'

page = read_html(XML)

Basically, I want to extract html_attr(x, "title") if <span> exists, otherwise just get html_text(x).

Code to do the first is:

page %>% html_nodes(xpath = '//td[@class="id-tag"]/span') %>% html_attr("title")
# [1] "Really Long Text"

Code to do the second is:

page %>% html_nodes(xpath = '//td[@class="id-tag"]') %>% html_text
# [1] "\n    Really L...\n" "Short"  

The real problem is that the html_attr approach doesn't give me any NA or something similar for the nodes that don't match (even if I let the xpath just be '//td[@class="id-tag"]' first to be sure I've narrowed down to only the relevant nodes first. This destroys the order -- I can't tell automatically whether the original structure had "Really Long Text" at the first or the second node.

(I thought of doing a join, but the mapping between the abbreviated text and the full text is not one-to-one/invertible).

This seems to be on the right path -- an if/else construction within the xpath -- but doesn't work.

Ideally I'd get the output:

# [1] "Really Long Text" "Short" 
Community
  • 1
  • 1
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198

2 Answers2

4

Based on R Conditional evaluation when using the pipe operator %>%, you can do something like

page %>% 
   html_nodes(xpath='//td[@class="id-tag"]') %>% 
   {ifelse(is.na(html_node(.,xpath="span")), 
           html_text(.),
           {html_node(.,xpath="span") %>% html_attr("title")}
   )}

I think it is possibly simple to discard the pipe and save some of the objects created along the way

nodes <- html_nodes(page, xpath='//td[@class="id-tag"]')
text <- html_text(nodes)
title <- html_attr(html_node(nodes,xpath='span'),"title")
value <- ifelse(is.na(html_node(nodes, xpath="span")), text ,title)

An xpath only approach might be

page %>% 
 html_nodes(xpath='//td[@class="id-tag"]/span/@title|//td[@class="id-tag"][not(.//span)]') %>%
 html_text()
Community
  • 1
  • 1
mnel
  • 113,303
  • 27
  • 265
  • 254
  • I think the fact that `html_node` returns `NA` for branches that don't have any match is what I was most in need of here, thanks! – MichaelChirico Jan 18 '17 at 01:12
  • And that joint `xpath` is a beast! Could you elaborate on what the `[not(.//span)]` bit is doing? I mainly don't understand the surrounding `[]`, I think? – MichaelChirico Jan 18 '17 at 01:28
  • @MichaelChirico , it was based on http://stackoverflow.com/questions/862239/xpath-get-node-with-no-child-of-specific-type, without the wrapping `[]` you get a syntax error (this part is looking for node td with class = id-tag but without a child node "span") – mnel Jan 18 '17 at 01:33
2

An alternate approach:

library(tidyverse)
library(rvest)

XML <- '
<td class="id-tag">
    <span title="Really Long Text">Really L...</span>
</td>
<td class="id-tag">Short</td>
'

pg <- read_html(XML)

html_nodes(pg, "td[class='id-tag']") %>%
  map_chr(function(x) {
    if (xml_find_first(x, "boolean(.//span)")) {
      x <- html_nodes(x, xpath=".//span/@title")
    }
    html_text(x)
  })

## [1] "Really Long Text" "Short"
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205