scrape multiple urls from a csv file with R

Question

I have a CSV file that contains information about a set of articles and the 9th volume refers to the URLs. I have successfully scraped the title and abstract by a single URL with the following code:

library('rvest')
url <- 'https://link.springer.com/article/10.1007/s10734-019-00404-5'
webpage <- read_html(url)

title_data_html <- html_nodes(webpage,'.u-h1')
title_data <- html_text(title_data_html)
head(title_data)

abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
abstract_data <- html_text(abstract_data_html)
head(abstract_data)

myTable = data.frame(Title = title_data, Abstract = abstract_data)
View(myTable)

Now I want to use R to scrape the title and abstract of each article. My question is how to import the URLs contained in the CVS file and how to write a for loop to scrape the data I need. I'm quite new to r so thanks in advance for your help.

Hey welcome! Since we don't know how your csv looks like can you provide the name of the csv and the name of the column that includes the URLs? — Yach, Apr 20 '20 at 13:17
It seems this can easily be done by applying a function over the column containing your urls. Have a look at [`lapply`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/lapply). As @Yach mentioned, it'd be good if you could provide a few lines of the .csv through a [reproducible example](https://stackoverflow.com/a/5963610/9046275). — anddt, Apr 20 '20 at 13:25

Mohanasundaram · Accepted Answer · 2020-04-21T04:09:43.803

Try This:

library(rvest)

URLs <- read.csv("urls.csv")
n <-nrow(URLs)
URLs2 <-character()

for (i in 1:n) {
  URLs2[i]<-as.character(URLs[i,1])

}

df <- data.frame(Row = as.integer(), Title=as.character(), Abstract=as.character(), stringsAsFactors = FALSE)

for (i in 1:n) {
  webpage <- tryCatch(read_html(URLs2[i]), error = function(e){'empty page'})
  if (!"empty page" %in% webpage) {
  title_data_html <- html_nodes(webpage,'.u-h1')
  title_data <- html_text(title_data_html)
  abstract_data_html <- html_nodes(webpage,'#Abs1-content p')
  abstract_data <- html_text(abstract_data_html)
  temp <- as.data.frame(cbind(Row = match(URLs2[i], URLs2), Title = title_data, Abstract = abstract_data))
  if(ncol(temp)==3) {
    df <- rbind(df,temp)
  }
}
}

View(df)

Edit: The code has been edited in such a way that it will work even if the urls are broken (skipping them). The output rows will be numbered with the entry's corresponding row number in the csv.

scrape multiple urls from a csv file with R

1 Answers1