9

This code attempts to download a page that does not exist:

url <- "https://en.wikipedia.org/asdfasdfasdf"
status_code <- download.file(url, destfile = "output.html", method = "libcurl")

This returns a 404 error:

trying URL 'https://en.wikipedia.org/asdfasdfasdf'
Error in download.file(url, destfile = "output.html", method = "libcurl") : 
  cannot open URL 'https://en.wikipedia.org/asdfasdfasdf'
In addition: Warning message:
In download.file(url, destfile = "output.html", method = "libcurl") :
  cannot open URL 'https://en.wikipedia.org/asdfasdfasdf': HTTP status was '404 Not Found'

but the code variable still contains a 0, even though the documentation for download.file states that the returned value is:

An (invisible) integer code, 0 for success and non-zero for failure. For the "wget" and "curl" methods this is the status code returned by the external program. The "internal" method can return 1, but will in most cases throw an error.

The results are the same if I use curl or wget as the download method. What am I missing here? Is the only option to call warnings() and parse the output?

I've seen other questions about using download.file, but none (that I can find) that actually retrieve the HTTP status code.

Michael A
  • 4,391
  • 8
  • 34
  • 61
  • 1
    i don't know R, nor the download.file wrapper, but the underlying libcurl way of getting the code is `long response_code; curl_easy_getinfo(ch,CURLINFO_RESPONSE_CODE,&response_code);` - check if your download.file api expose libcurl's curl_easy_getinfo() somehow – hanshenrik Dec 30 '18 at 00:32

2 Answers2

7

Probably the best option is to use cURL library directly rather than via the download.file wrapper which does not expose the full functionality of cURL. We can do this, for example, using the RCurl package (although other packages such as httr, or system calls can also achieve the same thing). Using cURL directly will allow you to access the cURL Info, including response code. For example:

library(RCurl)
curl = getCurlHandle()
x = getURL("https://en.wikipedia.org/asdfasdfasdf", curl = curl)
write(x, 'output.html')
getCurlInfo(curl)$response.code
# [1] 404

Although the first option above is much cleaner, if you really want to use download.file instead, one possible way would be to capture the warning using withCallingHandlers

try(withCallingHandlers( 
  download.file(url, destfile = "output.html", method = "libcurl"),
  warning = function(w) {
    my.warning <<- sub(".+HTTP status was ", "", w)
    }),
  silent = TRUE)

cat(my.warning)
'404 Not Found'
dww
  • 30,425
  • 5
  • 68
  • 111
5

If you don't mind using a different method you can try GET from the httr package:

url_200 <- "https://en.wikipedia.org/wiki/R_(programming_language)"
url_404 <- "https://en.wikipedia.org/asdfasdfasdf"

# OK
raw_200 <- httr::GET(url_200)
raw_200$status_code
#> [1] 200

# Not found
raw_404 <- httr::GET(url_404)
raw_404$status_code
#> [1] 404

Created on 2019-01-02 by the reprex package (v0.2.1)

Birger
  • 1,111
  • 7
  • 17