9

Is there a reasonably straightforward way to determine the file size of a remote file without downloading the entire file? Stack Overflow answers how to do this with PHP and curl, so I imagine it's possible in R as well. If possible, I believe it would be better to avoid RCurl, since that requires an additional installation for non-Windows users?

On this survey analysis website, I write lots of scripts to automatically download large data files from government agencies (like the us census bureau and the cdc). I am trying to implement an additional component that will not download a file that has already been downloaded, by creating a "download cache" - but I am concerned that this "download cache" might get corrupted if: 1) the host website changes a file or 2) the user cancels a download midway through. Therefore, when deciding whether to download a file from the source HTTP or FTP site, I want to compare the local file size to the remote file size.. And if they are not the same, download the file again.

Community
  • 1
  • 1
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77

2 Answers2

10

Nowadays a straight-forward approach might be

response = httr::HEAD(url)
httr::headers(response)[["Content-Length"]]

My original answer was: A more 'by hand' approach is to set the CURLOPT_NOBODY option (see man curl_easy_setopt on Linux, basically inspired by looking at the answers to the linked question) and tell getURL and friends to return the header along with the request

library(RCurl)
url = "http://stackoverflow.com/questions/20921593/how-to-determine-the-file-size-of-a-remote-download-without-reading-the-entire-f"
xx = getURL(url, nobody=1L, header=1L)
strsplit(xx, "\r\n")

## [[1]]
##  [1] "HTTP/1.1 200 OK"                             
##  [2] "Cache-Control: public, max-age=60"           
##  [3] "Content-Length: 60848"                       
##  [4] "Content-Type: text/html; charset=utf-8"      
##  [5] "Expires: Sat, 04 Jan 2014 14:09:58 GMT"      
##  [6] "Last-Modified: Sat, 04 Jan 2014 14:08:58 GMT"
##  [7] "Vary: *"                                     
##  [8] "X-Frame-Options: SAMEORIGIN"                 
##  [9] "Date: Sat, 04 Jan 2014 14:08:57 GMT"         
## [10] ""                                            

A peak at url.exists suggests parseHTTPHeader(xx) for parsing HTTP headers. getURL also works with ftp URLs.

url = "ftp://ftp2.census.gov/AHS/AHS_2004/AHS_2004_Metro_PUF_Flat.zip"
getURL(url, nobody=1L, header=1L)
## [1] "Content-Length: 21288307\r\nAccept-ranges: bytes\r\n"
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
5
url <- "http://cdn.meclabs.com/training/misc/2013_Marketing_Analytics_BMR-StrongView.pdf"
library(RCurl)
res <- url.exists(url, .header=TRUE)
as.numeric(res['Content-Length']) 
# [1] 42413630
## bytes
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • thanks! but this does not work on ftp sites? try with `url <- 'ftp://ftp2.census.gov/AHS/AHS_2004/AHS_2004_Metro_PUF_Flat.zip'` or with `url <- 'ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/NHIS/2012/personsx_layout.pdf'` – Anthony Damico Jan 04 '14 at 14:43
  • No, that won't work. As the help `?url.exists` says: _"This makes an HTTP request but with the nobody option set to FALSE"_. Although the examples of `?getURL` show that it should work also with ftp://. – lukeA Jan 04 '14 at 15:25
  • I don't know much about the ftp. Maybe there is no header. – lukeA Jan 04 '14 at 15:33