web scraping in R, a strange span class associated with an email

Question

I am trying to scrape emails from the following html line

<p><span>E-mail address:</span><a title="&#xA; Link to email address&#xA;  
  "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>

Now, I don't know why this span class is proving so difficult for me.

I am using R but am not really sure how to go about solving this issue.

email <- html_text(
    html_nodes(doc, ?????)

Here is my current scraper that I am using

scrape <- function(x){
doc<-read_html(x)
author <- html_text(html_nodes(doc, '.art_authors'))
year <- html_text(html_nodes(doc, '.year'))
journalName <- html_text(html_nodes(doc, '.journalName'))
art_title <- html_text(html_nodes(doc, '.art_title'))
volume <- html_text(html_nodes(doc, '.volume'))
page <- html_text(html_nodes(doc, '.page'))
email <- html_text(html_nodes(doc, xpath = "//a[@class = 'email']"))
email2 <- html_text(html_nodes(doc, xpath = "//a[@class = 'ext-link']"))
    Author = ifelse(length(author)==0, NA, author)
    Year = ifelse(length(year)==0, NA, year)
    Journal_Name = ifelse(length(journalName)==0, NA, journalName) 
    Art_Title = ifelse(length(art_title)==0, NA, art_title)
    Volume = ifelse(length(volume)==0, NA, volume)
    Page = ifelse(length(page)==0, NA, page)
    Email = ifelse(length(email)==0, NA, 
    ifelse(length(email)==1, email, paste(email, collapse=" ; ")))
    Email2 = ifelse(length(email2)==0, NA, 
    ifelse(length(email2)==1, email2, paste(email2, collapse=" ; ")))
row<-cbind(Author, Year, Journal_Name, Art_Title, Volume, Page, Email, Email2)
}

Please make this [a reproducible question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). What have you tried? What's going wrong? What's strange about the `span` class, since the `span` element in this HTML doesn't have a class? — camille, Aug 09 '18 at 18:19
yes, thats my problem, the span does not have a class so I really dont know how to pull it. — JWH2006, Aug 09 '18 at 18:20
The way you get information from an HTML element is going to depend on the code you've written—such as whether you're using XPath vs CSS selectors. If you don't include your code, we don't know how you're trying to do this. — camille, Aug 09 '18 at 18:22

score 3 · Answer 1 · answered Aug 09 '18 at 19:51

You can also use rvest::html_attr() to select 'a' tag:

library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="&#xA; Link to email 
address&#xA; "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')

doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')

## > doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')
## [1] "joeschmoe123@goodtimes.com"

score 1 · Answer 2 · answered Aug 09 '18 at 19:21

I guess the easiest way would be to choose <a> tags that have "mailto:" in the href= attribute. here's how you would do that

library(xml2)
library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="&#xA; Link to email address&#xA;  
                   "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')

html_nodes(doc, xpath='//a[starts-with(@href,"mailto:")]') %>% html_text()
# [1] "joeschmoe123@goodtimes.com"

web scraping in R, a strange span class associated with an email

2 Answers2