1

I'm having difficulty extracting a specific selection of text from the source code of a website. I can extract the entire list but I only need one country, for example Argentina in this case.

The source code is:

<div class="article-content">
                                    <div class="RichTextElement">
                                        <div><h3 style="background-color: transparent; color: rgb(51, 51, 51);"><span style="font-weight: normal; font-family: Verdana;">Afghanistan - </span><span style="background-color: transparent; font-weight: normal; font-family: Verdana;"><a title="Tax Authority in Afganistan" href="http://mof.gov.af/en" style="background-color: transparent; color: rgb(51, 51, 51);">Ministry of Finance</a><br />Argentina - <a title="Tax Authority in Argentina" href="http://www.afip.gob.ar/english/" style="background-color: transparent; color: rgb(51, 51, 51);">Federal Administration of Public Revenues</a><br />

I only need "Federal Administration of Public Revenues" and "http://www.afip.gob.ar/english/"

So far I have:

argurl <- readLines("http://oceantax.co.uk/links/tax-authorities-worldwide.html")

strong <-as.matrix(grep("<br//>",argurl))
strong1starts <- grep("<br //>Argentina",argurl)
rowst1st <- which(grepl(strong1starts, strong))
strong1ends <- strong[rowst1st + 1 ,]-1
data1 <- as.matrix(argurl[strong1starts:strong1ends])
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • 2
    [Don't use regular expressions to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags): instead, take a look at the [rvest](https://github.com/hadley/rvest) package for parsing HTML in R – David Robinson Feb 24 '15 at 18:34

1 Answers1

4
library(rvest)

url <- "http://oceantax.co.uk/links/tax-authorities-worldwide.html"
pg <- html(url)

# get the country node

# XPath version
country <- pg %>% html_nodes(xpath="//a[contains(@title, 'Argentina')]")

# CSS Selector version
country <- pg %>% html_nodes("a[title~=Argentina]")

# use one of the above then:

country %>% html_text()       # get the text of the anchor
country %>% html_attr("href") # get the URL of the anchor
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205