I am new to web scraping. I have managed to write a code effective for my task and requirements. Below is the replicable code:
library(tidyverse)
library(rvest)
library(stringr)
library(dplyr)
library(xml2)
## scraping hyperlinks
page <- read_html("https://www.annualreports.com/Companies?exch=9")
raw_list <- page %>%
html_nodes(".companyName a") %>%
html_attr("href") %>%
str_c("https://www.annualreports.com", .)
## the scraping task
for(i in raw_list){print(i)} %>%
read_html() %>%
html_nodes(".download a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("https://www.annualreports.com/", .)} %>%
walk2(., basename(.), download.file, mode = "wb")
However, I am facing an issue when the scraping task stops due to the URL being scraped either invalid or unavailable. Specifically, I get the following error:
trying URL 'https://www.annualreports.com/%2FHostedData%2FAnnualReportArchive%2Fadm2009.pdf' Error in .f(.x[[i]], .y[[i]], ...) : cannot open URL 'https://www.annualreports.com/%2FHostedData%2FAnnualReportArchive%2Fadm2009.pdf' In addition: Warning message: In .f(.x[[i]], .y[[i]], ...) : cannot open URL 'https://www.annualreports.com/%2FHostedData%2FAnnualReportArchive%2Fadm2009.pdf': HTTP status was '404 Not Found'
## location of the error is item number 9 in the list
raw_list[9] %>%
read_html() %>%
html_nodes(".download a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("https://www.annualreports.com/", .)} %>%
walk2(., basename(.), download.file, mode = "wb")
Since I cannot control the error itself (due to the problem with the URL itself), I want to surpass the error by letting R continue the process of scraping, i.e., moving onto the next URL in the list instead of stopping the scraping task when the error occurs.
tryCatch attempt - Failed
raw_list[9] %>%
read_html() %>%
html_nodes(".download a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("https://www.annualreports.com/", .)} %>%
walk2(., basename(.), tryCatch(download.file, error=function(e) NULL), mode = "wb")
The above tryCatch
iteration downloads active links but again fails to continue after HTTP 404 not found error.