4

For a project I need to download regularly data files from different websites to create an indicator based on those files.

As the update frequency of those files varies a lot, I am looking for an efficient way to detect whether a remote file was updated.

Below is suggested to use the -I option of curl. How does this translate in using the curl package?

https://superuser.com/questions/619592/get-modification-time-of-remote-file-over-http-in-bash-script

Alternate solutions seem to parse the header for either filesize or modifcation date:

Something similar to:

PHP: Remote file size without downloading file

My attempt below (with a small file), however, downloads the full file.

library(curl)


req <- curl_fetch_memory("http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata")
str(req)
object.size(req)
parse_headers(req$headers)

Ist it possible to either download just the header with the curl package or to specify an option to avoid redundant downloads?

Community
  • 1
  • 1
Lod
  • 609
  • 7
  • 19

2 Answers2

4

You'll have to keep a history of last-modified dates of the files (assuming the web server is consistent in reporting that) and check that with httr::HEAD() before downloading (i.e. you have some work to do vis a vis storing that last-modified value somewhere, probably in a data frame with the URL):

library(httr)

URL <- "http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata"

#' Download a file only if it hasn't changed since \code{last_modified}
#' 
#' @param URL url of file
#' @param fil path to write file
#' @param last_modified \code{POSIXct}. Ideally, the output from the first 
#'        successful run of \code{get_file()}
#' @param overwrite overwrite the file if it exists?
#' @param .verbose output a message if the file was unchanged?
get_file <- function(URL, fil, last_modified=NULL, overwrite=TRUE, .verbose=TRUE) {

  if ((!file.exists(fil)) || is.null(last_modified)) {
    res <- GET(URL, write_disk(fil, overwrite))
    return(httr::parse_http_date(res$headers$`last-modified`))
  } else if (inherits(last_modified, "POSIXct")) {
    res <- HEAD(URL)
    cur_last_mod <- httr::parse_http_date(res$headers$`last-modified`)
    if (cur_last_mod != last_modified) {
      res <- GET(URL, write_disk(fil, overwrite))
      return(httr::parse_http_date(res$headers$`last-modified`))
    }
    if (.verbose) message(sprintf("'%s' unchanged since %s", URL, last_modified))
    return(last_modified)
  } 

}

# first run == you don't know the last-modified date.
# you need to pair this with the URL in some data structure for later use.
last_mod <- get_file(URL, basename(URL))

class(last_mod)
## [1] "POSIXct" "POSIXt"

last_mod
## [1] "2015-11-16 17:34:06 GMT"

last_mod <- get_file(URL, basename(URL), last_mod)
#> 'http://www.pcr.uu.se/digitalAssets/124/124932_1ucdponesided2015.rdata' unchanged since 2015-11-16 17:34:06
hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

An alternative to the httr package is the base function base::curlGetHeaders(url), but you'll still need to parse the last modified date yourself!

jsavn
  • 701
  • 1
  • 8
  • 17