3

I'm trying to download the following dataset with download.file, which only works when method = "wget")

# Doesn't work
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "auto")
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "curl")

# Works
download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "wget")

According to help(download.file),

If method = "auto" is chosen (the default), the internal method is chosen for file:// URLs, and for the others provided capabilities("http/ftp") is true (which it almost always is).

Looking at the source code, "internal method" refers to:

if (method == "internal") {
        status <- .External(C_download, url, destfile, quiet, 
            mode, cacheOK)
        if (!quiet) 
            flush.console()
    }

But still, I don't know what .External(C_download) does, especially across platform. It's important for me to know this instead of relying on wget because I'm writing a package that should work cross-platform.

Heisenberg
  • 8,386
  • 12
  • 53
  • 102
  • I think it's referring to the configured method for your OS. It could be `curl`, for example. – Rich Scriven Jun 18 '15 at 04:31
  • Look at the function definition. Very clear what's going on when you call different `method` values. – Roman Luštrik Jun 18 '15 at 05:04
  • @RomanLuštrik Thanks for pointing to the source. It's clear that `method = "auto"` calls the `.External(C_download)`, but at this point I'm stumped again. How to know what this external function does? (It'd be much easier to find the C code for `.Internal()` and `.Primitive()`) – Heisenberg Jun 18 '15 at 16:26

2 Answers2

5

The source code for this is in the R sources (download the current version from http://cran.r-project.org/sources.html). The relevant code (as of R 3.2.1) is in "./src/modules/internet/internet.c" and "./src/modules/internet/nanohttp.c".

According to the latter, the code for the minimalist HTTP GET functionality is based on libxml2-2.3.6.

The files are also available on the R svn site at https://svn.r-project.org/R/branches/R-3-2-branch/src/modules/internet/internet.c and https://svn.r-project.org/R/branches/R-3-2-branch/src/modules/internet/nanohttp.c if you'd prefer not to download the whole .tgz file and decompress it.

If you look at the code, most of it is consistent across platforms. However, on Windows, the wininet code seems to be used.

The code was identified by looking initially in the utils package, since that is where the R command download.file is found. I grepped for download in the c files in the "./src/library/utils/src" directory and found that the relevant code was in "sock.c". There was a comment high up in that file which read /* from src/main/internet.c */ and so I next went to "internet.c".

With respect to your specific file, the issue is that the link you have returns a 302 Found status code. On Windows and using wget, the download routine follows the Location field of the 302 response and gets the actual file. Using the curl method works but only if you supply the parameter extra="-L".

download.file('http://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta', tempfile(), method = "curl", extra="-L")

There's a package called downloader which claims to offer a good cross-platform solution for https. Given an http URL, it just passes the call onto download.file. Here's a version that works for http too. It also defaults to binary transfers, which seems generally to be a good idea.

my_download <- function(url, destfile, method, quiet = FALSE,
                        mode = "wb", cacheOK = TRUE, extra = getOption("download.file.extra")) {
  if (.Platform$OS.type == "windows" && (missing(method) || method %in% c("auto", "internal", "wininet"))) {
    seti2 <- utils::"setInternet2"
    internet2_start <- seti2(NA)
    on.exit(suppressWarnings(seti2(internet2_start)))
    suppressWarnings(seti2(TRUE))
  } else {
    if (missing(method)) {
      if (nzchar(Sys.which("wget")[1])) {
        method <- "wget"
      } else if (nzchar(Sys.which("curl")[1])) {
        method <- "curl"
        if (!grepl("-L", extra)) {
          extra <- paste("-L", extra)
        }
      } else if (nzchar(Sys.which("lynx")[1])) {
        method <- "lynx"
      } else {
        stop("no download method found")
      }
    }
  }
  download.file(url = url, destfile = destfile, method = method, quiet = quiet, mode = mode,
                cacheOK = cacheOK, extra = extra)
}
Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • Could you please explain how you manage to trace it back to this source (esp because this is a `.External()` call)? And given this is how the internal method works, why doesn't it work with the file in my question? – Heisenberg Jun 26 '15 at 22:40
  • @Heisenberg added this to the above, as well as a crossplatform function which should follow 302 redirects on Windows, Mac and Linux. Based on `downloader::download` – Nick Kennedy Jun 27 '15 at 07:39
  • Great detailed answer! So I deduce from this is that 1) the "internal" method doesn't work with this case because the link changes to `https`, perhaps as a Box.com features? And 2) the solution you suggested is to use the `downloader` package or the custom download function? – Heisenberg Jun 27 '15 at 16:49
  • 1
    @Heisenberg not quite. The internal method on Linux doesn't handle 302 redirects at all. Nor does curl without the -L argument. The `downloader` package does nothing unless there is an https URL. The function I've provided should (in theory) work for the provided URL on all three platforms but I haven't tested on Mac OS X – Nick Kennedy Jun 27 '15 at 17:08
  • 1
    Sorry I shouldn't have said does nothing for `downloader::download`; I meant it merely passes on the request unchanged to `download.file`. – Nick Kennedy Jun 27 '15 at 17:29
  • @Heisenberg did you have any further questions related to this code? Does the code above meet the requirements for your package? It's perhaps worth also mentioning the `httr` package if the intention is to download data directly into R rather than into an external file. – Nick Kennedy Jun 29 '15 at 15:12
  • Thanks very much! Your answer on how to grep external source file is very educational, as is the part on `302 Found` response. In terms of actual solution, I've gone with package `rio`' `import` function, which is able to handle the url without a custom function. – Heisenberg Jun 29 '15 at 22:51
1

You can answer this yourself. Just type download.file at the console prompt and you should see this near the top of the function definition:

if (method == "auto") {   # this is actually the default from
                          # getOption("download.file.method", default = "auto")

        if (capabilities("http/ftp")) 
            method <- "internal"
        else if (length(grep("^file:", url))) {
            method <- "internal"
            url <- URLdecode(url)
        }
        else if (system("wget --help > /dev/null") == 0L) 
            method <- "wget"
        else if (system("curl --help > /dev/null") == 0L) 
            method <- "curl"
        else if (system("lynx -help > /dev/null") == 0L) 
            method <- "lynx"
        else stop("no download method found")
    }
    if (method == "internal") {
        status <- .External(C_download, url, destfile, quiet, 
            mode, cacheOK)
        if (!quiet) 
            flush.console()
    }
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • 2
    Thanks for pointing towards the source. However, since R is calling the external function `C_download` (instead of a `.Internal()` or `.Primitive()`), I couldn't find the source code for it. So, it's not very clear what calling `.External(C_download)` means across platform. – Heisenberg Jun 18 '15 at 16:25
  • The C code is always available.https://stackoverflow.com/questions/19226816/how-can-i-view-the-source-code-for-a-function – IRTFM Dec 01 '22 at 02:11