scraping text from a list of urls using r

Question

I am trying to scrap titles and contents from a list of urls using r. I am able to extract title and content for each article individually. However, I need to loop through these list of urls to get the title from each page and its content.

These are the urls, and they are stored in a csv file: http://well.blogs.nytimes.com/2016/08/29/edible-sunscreens-all-the-rage-but-no-proof-they-work/?smid=fb-nytwell&smtyp=cur

http://www.nytimes.com/2016/08/30/well/live/how-12-epipens-saved-my-life.html?smid=fb-nytwell&smtyp=cur

http://www.nytimes.com/2016/08/29/opinion/why-we-never-die.html?smid=fb-nytwell&smtyp=cur

http://www.nytimes.com/2016/08/31/health/how-to-ride-downhill-on-a-bicycle.html?smid=fb-nytwell&smtyp=cur

http://www.cbssports.com/college-football/news/one-sweet-gesture-by-fsus-travis-rudolph-makes-mom-of-an-autistic-boy-cry/

http://www.nytimes.com/2016/08/31/well/family/what-kids-wish-their-teachers-knew.html?smid=fb-nytwell&smtyp=cur

This is the code I used to extract each article individually (note that each paragraph of the content is considered to be a node and when I extract these nodes each one appears in a new raw while I need them to be only in the first raw).

install.packages('xml2')    
library(xml2)    
library(rvest)

url <- "http://well.blogs.nytimes.com/2016/08/29/edible-sunscreens-all-the-rage-but-no-proof-they-work/?smid=fb-nytwell&smtyp=cur"

article <- read_html(url)    
title <- article %>% html_node(".entry-title") %>% html_text()    
content <- article %>% html_nodes(".story-body-text") %>% html_text()    
article_table <- data.frame(title, content)

article_table

user5249203 · Answer 1 · 2018-04-11T19:02:00.690

0

You need to collapse the output to get into single line for the article

content <-
    article %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = "")

For multiple urls, its been already answered here

Adapted it to your case. please note that, .entry-title tag does not work for all urls. you need to use title

library(rvest)
library(purrr)
article <- listofurls %>% map(read_html)
title <-
    article %>% map_chr(. %>% html_node("title") %>% html_text())
content <-
    article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))
article_table <- data.frame("Title" = title, "Content" = content)
dim(article_table)

edited Apr 11 '18 at 19:02

answered Apr 11 '18 at 17:49

user5249203

4,436
1
19
45

Great. That works fine. Any idea about looping through these urls to extract the contents? – Majed Alghamdi Apr 11 '18 at 18:25
I think that has been already answered. If my solution worked, please accept the answer. Thank you – user5249203 Apr 11 '18 at 18:58
Thank you user5249203 for your answer. I am still looking for an answer for the second half of the question (how to loop through a list of urls). I will wait for others to help. – Majed Alghamdi Apr 11 '18 at 19:17
Awesome! That's helpful.But for some urls the content doesn't appear. – Majed Alghamdi Apr 11 '18 at 19:46
For example, the content for 3rd url and 5th url does not show. – Majed Alghamdi Apr 11 '18 at 19:48

scraping text from a list of urls using r

1 Answers1