0

I am trying to scrape emails from the following html line

<p><span>E-mail address:</span><a title="&#xA; Link to email address&#xA;  
  "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>

Now, I don't know why this span class is proving so difficult for me.

I am using R but am not really sure how to go about solving this issue.

email <- html_text(
    html_nodes(doc, ?????)

Here is my current scraper that I am using

scrape <- function(x){
doc<-read_html(x)
author <- html_text(html_nodes(doc, '.art_authors'))
year <- html_text(html_nodes(doc, '.year'))
journalName <- html_text(html_nodes(doc, '.journalName'))
art_title <- html_text(html_nodes(doc, '.art_title'))
volume <- html_text(html_nodes(doc, '.volume'))
page <- html_text(html_nodes(doc, '.page'))
email <- html_text(html_nodes(doc, xpath = "//a[@class = 'email']"))
email2 <- html_text(html_nodes(doc, xpath = "//a[@class = 'ext-link']"))
    Author = ifelse(length(author)==0, NA, author)
    Year = ifelse(length(year)==0, NA, year)
    Journal_Name = ifelse(length(journalName)==0, NA, journalName) 
    Art_Title = ifelse(length(art_title)==0, NA, art_title)
    Volume = ifelse(length(volume)==0, NA, volume)
    Page = ifelse(length(page)==0, NA, page)
    Email = ifelse(length(email)==0, NA, 
    ifelse(length(email)==1, email, paste(email, collapse=" ; ")))
    Email2 = ifelse(length(email2)==0, NA, 
    ifelse(length(email2)==1, email2, paste(email2, collapse=" ; ")))
row<-cbind(Author, Year, Journal_Name, Art_Title, Volume, Page, Email, Email2)
}
Joel
  • 1,564
  • 7
  • 12
  • 20
JWH2006
  • 239
  • 1
  • 11
  • Please make this [a reproducible question](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). What have you tried? What's going wrong? What's strange about the `span` class, since the `span` element in this HTML doesn't have a class? – camille Aug 09 '18 at 18:19
  • yes, thats my problem, the span does not have a class so I really dont know how to pull it. – JWH2006 Aug 09 '18 at 18:20
  • The way you get information from an HTML element is going to depend on the code you've written—such as whether you're using XPath vs CSS selectors. If you don't include your code, we don't know how you're trying to do this. – camille Aug 09 '18 at 18:22
  • thanks, I went ahead and updated my post with my code – JWH2006 Aug 09 '18 at 18:26

2 Answers2

3

You can also use rvest::html_attr() to select 'a' tag:

library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="&#xA; Link to email 
address&#xA; "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')

doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')

## > doc %>% html_node('a') %>% html_attr('href') %>% str_remove('mailto:')
## [1] "joeschmoe123@goodtimes.com"
ryanhnkim
  • 269
  • 1
  • 7
1

I guess the easiest way would be to choose <a> tags that have "mailto:" in the href= attribute. here's how you would do that

library(xml2)
library(rvest)
doc <- read_html('<p><span>E-mail address:</span><a title="&#xA; Link to email address&#xA;  
                   "href="mailto:joeschmoe123@goodtimes.com">joeschmoe123@goodtimes.com</a></p>')

html_nodes(doc, xpath='//a[starts-with(@href,"mailto:")]') %>% html_text()
# [1] "joeschmoe123@goodtimes.com"
MrFlick
  • 195,160
  • 17
  • 277
  • 295