0

I am trying to scrap titles and contents from a list of urls using r. I am able to extract title and content for each article individually. However, I need to loop through these list of urls to get the title from each page and its content.

These are the urls, and they are stored in a csv file: http://well.blogs.nytimes.com/2016/08/29/edible-sunscreens-all-the-rage-but-no-proof-they-work/?smid=fb-nytwell&smtyp=cur

http://www.nytimes.com/2016/08/30/well/live/how-12-epipens-saved-my-life.html?smid=fb-nytwell&smtyp=cur

http://www.nytimes.com/2016/08/29/opinion/why-we-never-die.html?smid=fb-nytwell&smtyp=cur

http://www.nytimes.com/2016/08/31/health/how-to-ride-downhill-on-a-bicycle.html?smid=fb-nytwell&smtyp=cur

http://www.cbssports.com/college-football/news/one-sweet-gesture-by-fsus-travis-rudolph-makes-mom-of-an-autistic-boy-cry/

http://www.nytimes.com/2016/08/31/well/family/what-kids-wish-their-teachers-knew.html?smid=fb-nytwell&smtyp=cur

This is the code I used to extract each article individually (note that each paragraph of the content is considered to be a node and when I extract these nodes each one appears in a new raw while I need them to be only in the first raw).

install.packages('xml2')    
library(xml2)    
library(rvest)

url <- "http://well.blogs.nytimes.com/2016/08/29/edible-sunscreens-all-the-rage-but-no-proof-they-work/?smid=fb-nytwell&smtyp=cur"

article <- read_html(url)    
title <- article %>% html_node(".entry-title") %>% html_text()    
content <- article %>% html_nodes(".story-body-text") %>% html_text()    
article_table <- data.frame(title, content)

article_table
user5249203
  • 4,436
  • 1
  • 19
  • 45

1 Answers1

0

You need to collapse the output to get into single line for the article

content <-
    article %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = "")

For multiple urls, its been already answered here

Adapted it to your case. please note that, .entry-title tag does not work for all urls. you need to use title

library(rvest)
library(purrr)
article <- listofurls %>% map(read_html)
title <-
    article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
    article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content)
dim(article_table)
user5249203
  • 4,436
  • 1
  • 19
  • 45