R - How do I skip a bad url in scraping pdfs from websites to avoid rerunning the scraping task?

Question

I am new to web scraping. I have managed to write a code effective for my task and requirements. Below is the replicable code:

library(tidyverse)
library(rvest)
library(stringr)
library(dplyr)
library(xml2)

## scraping hyperlinks

page <- read_html("https://www.annualreports.com/Companies?exch=9")

raw_list <- page %>%
  html_nodes(".companyName a") %>%
  html_attr("href") %>% 
  str_c("https://www.annualreports.com", .)

## the scraping task

for(i in raw_list){print(i)} %>%
read_html() %>%
html_nodes(".download a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("https://www.annualreports.com/", .)} %>%
walk2(., basename(.), download.file, mode = "wb")

However, I am facing an issue when the scraping task stops due to the URL being scraped either invalid or unavailable. Specifically, I get the following error:

trying URL 'https://www.annualreports.com/%2FHostedData%2FAnnualReportArchive%2Fadm2009.pdf' Error in .f(.x[[i]], .y[[i]], ...) : cannot open URL 'https://www.annualreports.com/%2FHostedData%2FAnnualReportArchive%2Fadm2009.pdf' In addition: Warning message: In .f(.x[[i]], .y[[i]], ...) : cannot open URL 'https://www.annualreports.com/%2FHostedData%2FAnnualReportArchive%2Fadm2009.pdf': HTTP status was '404 Not Found'

## location of the error is item number 9 in the list

raw_list[9] %>%
read_html() %>%
  html_nodes(".download a") %>%
  html_attr("href") %>%
  url_escape() %>%
  {paste0("https://www.annualreports.com/", .)} %>%
  walk2(., basename(.), download.file, mode = "wb")

Since I cannot control the error itself (due to the problem with the URL itself), I want to surpass the error by letting R continue the process of scraping, i.e., moving onto the next URL in the list instead of stopping the scraping task when the error occurs.

tryCatch attempt - Failed

raw_list[9] %>%
  read_html() %>%
  html_nodes(".download a") %>%
  html_attr("href") %>%
  url_escape() %>%
  {paste0("https://www.annualreports.com/", .)} %>%
  walk2(., basename(.), tryCatch(download.file, error=function(e) NULL), mode = "wb")

The above tryCatch iteration downloads active links but again fails to continue after HTTP 404 not found error.

Does this help https://stackoverflow.com/q/12193779/5784831? — Christoph, Jul 27 '21 at 13:58
@Chistoph: I tried to wrap my `walk2` application around `try()` and `tryCatch()`. Wrapping around `try()` simply returns the values of the URLs and wrapping around `tryCatch` attempts to download the first URL and gets stuck there and time outs. — fsure, Jul 28 '21 at 11:16

score 0 · Answer 1 · answered Jul 27 '21 at 15:41

This code I wrote is for webscraping but for a long list of downloadable urls but it has the tryCatch skipping errors for you. It is unclear what type of file you are trying to download.

DL_function<-function(x,df) {
  return( tryCatch(
    download.file(df[x,1],paste0(df[x,3],"_", df[x,4],".png"),method = "auto" ,mode = "ab", cacheOK = TRUE), error=function(e) NULL))
}

it should work if modified like this

DL_function<-function(x) {
  return( tryCatch(
    download.file(x, paste(x),method = "auto" ,mode = "ab", cacheOK = TRUE), error=function(e) NULL))
}

for(i in raw_list){print(i)} %>%
read_html() %>%
html_nodes(".download a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("https://www.annualreports.com/", .)} %>%
DL_function()

Hi. The code returns "NULL" as a response only. It doesn't even download the downloadable URLs to reach the file which actually shows the error. I substituted your `download.file` function with my `walk2` function, same result with both of them. Also, I am downloading .pdf files. Thanks for your response though. Appreciate it. — fsure, Jul 28 '21 at 04:44
It appears the html_nodes + html_attr() are not scraping the pdf links. At least in my session that is the case. — d3hero23, Jul 28 '21 at 15:03

R - How do I skip a bad url in scraping pdfs from websites to avoid rerunning the scraping task?

1 Answers1