3

Consider this simple RCurl function to report download progress:

library(RCurl)
curlDown=function(url, follow=TRUE){
    x=getURL(url, followlocation=follow, noprogress = FALSE,
        progressfunction=function(down,up) cat(down, '\n'))    
}

Note that with followlocation=TRUE (default) we accept to follow the possible redirect location that the server sends as part of the HTTP header.

We get:

curlDown("http://www.example.com")
# 0 0 
# 1270 1079 
# 1270 1127 
# 1270 1270 
# 1270 1270 
# 1270 1270 

As you can see the down variable passed to the callback by RCurl is a numeric vector, where the first element is the total download in bytes and the second is the running download size. Due to space constraints, I don't show this here, but upon separate inspection I saw the former is equivalent to the Content-Length field in the response header.

Not every server gives the Content-Length field in the response header:

curlDown("http://www.google.it")
# 0 0  
# 0 603
# ... blah blah
# 0 44848 
# 0 44848 

In this case RCurl sets the missing total value to zero (would NA have been better?).

Main Google domain, ".com" redirects to a country specific domain, for example ".it" if you are querying from the country associated with this domain (Italy). If you are physical located in the '.it'-domain, you get:

curlDown("http://www.google.com")
# 0 0 
# 274 274 
# 274 274 
# 274 274 
# 274 0 
# 274 0 
# 274 603
# ... blah blah
# 274 44896 
# 274 44896 

These results are strange. If you compare the running download values with the previous curlDown("http://www.google.it"), you understand that after the redirect, the values are the same, as you expected; but the total is smaller than the running download!

To understand the problem we do not follow the redirect location:

curlDown("http://www.google.com", follow=FALSE)
# 0 0 
# 274 274 
# 274 274 
# 274 274 

The main domain server .com sends the Content-Length, 274 bytes, while the redirected server does not (see the zero's in curlDown("http://www.google.it").

The problem is that, after redirection, RCurl does not update the value for the total download size (to zero for the case of unknown size), which remains stacked to the wrong value of 274 bytes.

Is this a BUG or am I missing something?

antonio
  • 10,629
  • 13
  • 68
  • 136

3 Answers3

3

I think Rcurl is faithfully forwarding the values from curl, e.g., as documented on curl_set_easyopt under CURLOPT_PROGRESSFUNCTION missing values are returned as 0. If there's a bug then it's with curl. Here's a simple program (see here to get going)

#include <stdio.h>
#include <curl/curl.h>

curl_progress_callback progress(void *clientp, double dltotal, double dlnow,
                                double ultotal, double ulnow)
{
    fprintf(stderr, "PROGRESS: %.0f %.0f %.0f %.0f\n",
            dltotal, dlnow, ultotal, ulnow);
    return 0;
}

int main(int argc, char **argv)
{
    CURL *curl;
    CURLcode res;

    curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, argv[1]);
    curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
    curl_easy_setopt(curl, CURLOPT_NOPROGRESS, 0L);
    curl_easy_setopt(curl, CURLOPT_PROGRESSFUNCTION, progress);
    res = curl_easy_perform(curl);
    curl_easy_cleanup(curl);

    return 0;
}

and it's evaluation

$ clang curl.c -lcurl && ./a.out http://google.com > /dev/null
PROGRESS: 0 0 0 0
PROGRESS: 0 0 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 219 0 0
PROGRESS: 219 0 0 0
PROGRESS: 219 2097 0 0
PROGRESS: 219 6441 0 0
PROGRESS: 219 12233 0 0
PROGRESS: 219 20921 0 0
PROGRESS: 219 32505 0 0
PROGRESS: 219 45360 0 0
PROGRESS: 219 45360 0 0
PROGRESS: 219 45360 0 0
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • so let's say at least a weird behaviour... (on behalf of Curl). I am designing a progress bar and it gets a problem when the total size is wrong. – antonio Feb 23 '14 at 22:54
1

There are various relevant answers here, for example:

In short, it's not possible to create a progress bar for a site that uses chunked transfer encoding (i.e., the situations where there is no "Content-Length" header).

You'll have to either skip the progress bar in those cases (see, as an example, my answer to your previous question) or set a very high initial overestimate for the file size, knowing that the bar will never actually reach 100%.

Community
  • 1
  • 1
Thomas
  • 43,637
  • 12
  • 109
  • 140
0

Based also on your feedback (i.e. no motivation for described behaviour), there is an actual bug (in curl).

One way to fix it in RCurl is to manually requery the server when a location redirect field is found in the server answer.

curlDown=function(url, curl =NULL){
    if(is.null(curl)) curl = getCurlHandle()
    h= basicHeaderGatherer()
    x=getURL(url, curl = curl, noprogress = FALSE,
        headerfunction = h$update,
        progressfunction=function(down,up)   cat(down, '\n'))
    loc=h$value()["Location"]
    if(!is.na(loc)) curlDown(loc)               
}

Now we query a server with a redirect:

# curlDown("http://www.google.com") 
# 0 0 
# 258 258 
# 258 258 
# 258 258 
# 0 0 
# 0 603 
# 0 2003 
# ... blah blah
# 0 44824 
# 0 44824 
# 0 44824 

When the request is redirected from the main server to the country specific server, the new server answer has no content length and this is reported as zero (according to RCurl general behaviour).

antonio
  • 10,629
  • 13
  • 68
  • 136