0

I am looping through a .csv filled with urls to scrape a website (authorizing scraping).

I was using a trycatch function to try to avoid breaks in my for loop. But I noticed it stops for some urls (using download.file).

So I am now using a « is this a valid url? » function taken from this post: [Scrape with a loop and avoid 404 error

url_works <- function(url){
tryCatch(
    identical(status_code(HEAD(url)),200L), 
    error = function(e){
        FALSE
    })
}

But even with this function, and looping only if outcome of the function is TRUE, at some point my loop breaks on some urls and I get the following error:

> HTTP status was '500 Internal Server Error'

I would like to understand this error so that I add this case in the URL function to ignore in case of this url type comes out again.

Any thoughts ? Thanks !

Gautam
  • 2,597
  • 1
  • 28
  • 51
ML_Enthousiast
  • 1,147
  • 1
  • 15
  • 39
  • [httr](http://httr.r-lib.org/) has some ways of dealing with this, or [`purrr::possibly`](https://purrr.tidyverse.org/reference/safely.html) – alistaire Sep 17 '18 at 15:40

1 Answers1

1

Your tryCatch syntax is wrong, I also changed the error message to print the error:

A generic tryCatch looks like:

tryCatch({
    operation-you-want-to-try
   }, error = function(e) do-this-on-error
)

So for your code:

url_works <- function(url){
    tryCatch({
        s1 <- status_code(HEAD(url))
        }, error = function(e) print(paste0(url, " ", as.character(e)))
    )
    identical(s1, 200L)
}
Mako212
  • 6,787
  • 1
  • 18
  • 37
  • I tried it but it still gives the same error and stops. – ML_Enthousiast Sep 17 '18 at 17:59
  • Please provide an example of what a `url` is so we can test. – Mako212 Sep 17 '18 at 18:01
  • I actually resolved the solution thanks to your first comment ! About the trycatch. I edited with your version and it seems that now the download.file does not stop the for loop even we there is an error. For the url_works function, I am really intrigued by the fact that the status code will always be 200 when the url is working, do we know the reason for that ? I wiil mark your response as correct ! – ML_Enthousiast Sep 17 '18 at 19:19
  • 1
    `HTTP` has a standard set of response codes based on the server's response to the client request, the 200 code is the standard response for a successful request: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes – Mako212 Sep 17 '18 at 19:25