8

In RCurl a function and a class CFILE is defined to work with C-level file handles. From the manual:

The intent is to be able to pass these to libcurl as options so that it can read or write from or to the file. We can also do this with R connections and specify callback functions that manipulate these connections. But using the C-level FILE handle is likely to be significantly faster for large files.

There are no examples related to downloads so I tried:

library(RCurl)
u = "http://cran.r-project.org/web/packages/RCurl/RCurl.pdf"
f = CFILE("RCurl.pdf", mode="wb")
ret= getURL(u,  write = getNativeSymbolInfo("R_curl_write_binary_data")$address,
                file  = f@ref)

I also tried by replacing the file option with writedata = f@ref. The file is downloaded but it is corrupted. Writing custom callback for the write argument works only for non-binary data.

Any idea to download a binary file straight to disk (without loading it in memory) in RCurl?

antonio
  • 10,629
  • 13
  • 68
  • 136

2 Answers2

7

I think you want to use writedata and remember to close the file

library(RCurl)
filename <- tempfile()
f <- CFILE(filename, "wb")
url <- "http://cran.fhcrc.org/Rlogo.jpg"
curlPerform(url = url, writedata = f@ref)
close(f)

For more elaborate writing, I'm not sure if this is the best way, but Linux tells me, from

man curl_easy_setopt

that there's a curl option CURL_WRITEFUNCTION that is a pointer to a C function with prototype

size_t function(void *ptr, size_t  size, size_t nmemb, void *stream);

and in R at the end of ?curlPerform there's an example of calling a C function as the 'writefunction' option. So I created a file curl_writer.c

#include <stdio.h>

size_t
writer(void *buffer, size_t size, size_t nmemb, void *stream)
{
    fprintf(stderr, "<writer> size = %d, nmemb = %d\n",
            (int) size, (int) nmemb);
    return size * nmemb;
}

Compiled it

R CMD SHLIB curl_writer.c

which on Linux produces a file curl_writer.so, and then in R

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
curlPerform(URL=url, writefunction=writer)

and get on stderr

<writer> size = 1, nmemb = 2653
<writer> size = 1, nmemb = 520
OK 

These two ideas can be integrated, i.e., writing to an arbitrary file using an arbitrary function, by modifying the C function to use the FILE * we pass in, as

#include <stdio.h>

size_t
writer(void *buffer, size_t size, size_t nmemb, void *stream)
{
    FILE *fout = (FILE *) stream;
    fprintf(fout, "<writer> size = %d, nmemb = %d\n",
            (int) size, (int) nmemb);
    fflush(fout);
    return size * nmemb;
}

and then back in R after compiling

dyn.load("curl_writer.so")
writer <- getNativeSymbolInfo("writer", PACKAGE="curl_writer")$address
f <- CFILE(filename <- tempfile(), "wb")
curlPerform(URL=url, writedata=f@ref, writefunction=writer)
close(f)

getURL can be used here, too, provided writedata=f@ref, write=writer; I think the problem in the original question is that R_curl_write_binary_data is really an internal function, writing to a buffer managed by RCurl, rather than a file handle like that created by CFILE. Likewise, specifying writedata without write (which seems from the source code to getURL to be an alias for writefunction) sends a pointer to a file to a function expecting a pointer to something else; for getURL both writedata and write need to be provided.

Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • Thanks. As I wrote, I had tried `getURL(url = url, writedata = f@ref)`, which doesn't work. So it seems that only a subset of the parameters in `listCurlOptions()` can be actually passed to `getURL`. Some are accepted only by `curlPerform`. I don't think this is mentioned by the manual. – antonio Mar 17 '13 at 23:09
  • @antonio from looking at `getURL` and the RCurl source code, the default argument `write` is not appropriate for a custom file, and R_curl_write_binary_data is operating on an internal data structure not a file handle; providing both `write` and `writedata` arguments is enough, I think to use getURL. – Martin Morgan Mar 18 '13 at 02:02
  • As you said, one has to look at the source code. Some more hints in the manual could be helpful. – antonio Mar 18 '13 at 20:56
1

I am working on this problem as well and don't have an answer, yet.

However, I did find this:

http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTWRITEDATA

Are you working on R under Windows? I am.

This documentation for the writedata function indicates that on windows, you must use writefunction along with writedata.

Reading here: http://www.omegahat.org/RCurl/RCurlJSS.pdf I found that RCurl expects the writefunction to be an R function, so we can implement that ourselves on windows. It is going to be slower than using a C function to write the data, however I bet that the speed of the network link will be the bottleneck.

getURI(url="sftp://hostname/home/me/onegeebee", curl=con, write=function(x) writeChar(x, f, eos=NULL))
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) : embedded nul in string: ' <`á\017_\021

(This is after creating a 1GB file on the server to test transfer speed)

I haven't yet found an answer that doesn't choke on NUL bytes in the data. It seems that somewhere in the bowels of the RCurl package when it's passing data up into R to execute the writefunction you supply, it tries to convert the data into a character string. It must not do that if you use a C function. Notably using the recommended R_curl_write_binary_data callback along with CFILE kills rsession.exe on win32 every time for me.

Keith Twombley
  • 1,666
  • 1
  • 17
  • 21