6

Question 1: When url is downloaded using libcurl, how to preserve the original name of downloaded file ? LibCurl asks programmer to generate filename. Which is maybe easy when URL has the e.g. in below url its easy to figure out target name is vimqrc.pdf.

 http://tnerual.eriogerg.free.fr/vimqrc.pdf)  

but when URL is dynamically generating target name e.g.below URL downloads AdbeRdr1010_eu_ES.exe. with wget (no arguments except URL) and curl (argument -O)

http://get.adobe.com/reader/download/?installer=Reader_10.1_Basque_for_Windows&standalone=1%22

How does curl (-O) or wget figures out name of

//invoked as ./a.out <URL>

#include <stdio.h>
#include <curl/curl.h>

char *location = "/tmp/test/out";

size_t write_data(void *ptr, size_t size, size_t nmemb, FILE *stream) {
    size_t written = fwrite(ptr, size, nmemb, stream);
    return written;
}

int main(int argc, char *argv[])
{
    CURL        *curl;
    CURLcode    res;
    int         ret = -1;


    if (argc!= 2) {
        //invoked as ./a.out <URL>
        return -1;
    } 

    curl = curl_easy_init();
    if (!curl) {
        goto bail;
    }

    FILE *fp = fopen(location, "wb");
    curl_easy_setopt(curl, CURLOPT_URL, argv[1]); //invoked as ./a.out <URL>
    /* example.com is redirected, so we tell libcurl to follow redirection */
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_data);
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);

    /* Perform the request, res will get the return code */
    res = curl_easy_perform(curl);
    /* Check for errors */
    if(res != CURLE_OK)
        fprintf(stderr, "curl_easy_perform() failed: %s\n",
                curl_easy_strerror(res));

    /* always cleanup */
    curl_easy_cleanup(curl);
    ret = 0;
    fclose(fp);

bail:
    return ret;
}
bladeWalker
  • 978
  • 1
  • 13
  • 30
  • possible duplicate of [Download file using libcurl in C/C++](http://stackoverflow.com/questions/1636333/download-file-using-libcurl-in-c-c) – WhozCraig Aug 29 '14 at 22:04
  • Maybe I was not clear, my need is to preserve the original name of the downloaded file and not use name specified explicitly. Second requirement is to download at specific location. Sorry, but I could not find answers for these in suggested duplicate. – bladeWalker Aug 29 '14 at 22:11
  • 1
    The latter is done by writing the file to whatever location *you* decide to write to. **You** open the `FILE*` to which you target the write. Regarding the first answer, libcurl pulls the file as a byte stream (if configured properly). If you want to "know" the name of the file you just requested you could either retain it in the WRITEDATA you provide (a struct including the name and a `FILE*` for writing would work), or a more elaborate HEADERFUNCTION/DATA could be used, though it would be considerably more complex. Is *that* what you're trying to do? And what do you have so far? Post it. – WhozCraig Aug 29 '14 at 22:16
  • @WhozCraig Added code I have so far. Thanks for the help. – bladeWalker Aug 30 '14 at 00:08
  • You look like you're setting up the request correctly. Btw did you know if you simply set `curl_easy_setopt(curl, CURLOPT_WRITEDATA, fp);` and do *not* set the write *function* option, libcurl will expect the writedata you set to be a file pointer and perform the write operation you're doing automatically? (note: this is *NOT* true for Windows libcurl as a DLL; you must provide a write-function on that platform). Anyway, is the problem you're trying to solve your write-function somehow being aware of what the *source* URL was? or what the *target* filename was? (or both) ? – WhozCraig Aug 30 '14 at 00:31
  • Later "what the target filename was?" and also way to specify directory for "target".e.g. Download "target" in /tmp/test. – bladeWalker Aug 30 '14 at 00:36
  • So when this is finished you want `/tmp/test/vimqrc.pdf` to be somehow derived from the source URL and some local-path? So some function like `download(char const url[], char const target_fldr[]);` and have it assemble the target path from the source url and target diretory, request the file, write it, and done? Is that close? Its a housekeeping thing on your part (some string parsing and some potential path-creating, etc), but it should be doable. I don't think libcurl has a mechanism for specifying that action in its options. Pretty sure you have to do it yourself. – WhozCraig Aug 30 '14 at 00:39
  • Just like wget. e.g. "wget http://tnerual.eriogerg.free.fr/vimqrc.pdf" downloads "vimqrc.pdf" and "wget http://www.vim.org/ugrankar.pdf" downloads "ugrankar.pdf" AND "http://get.adobe.com/reader/download/?installer=Reader_10.1_Basque_for_Windows&standalone=1" downloads "AdbeRdr1010_eu_ES.exe", wget is using whatever name file was saved with on server. The same is preserved. Here the name is not generated programmatically. – bladeWalker Sep 02 '14 at 19:14
  • That would be correct. It simply fetches the bits. You (the *programmer*) decide where those bits are written programmatically. – WhozCraig Sep 02 '14 at 19:15
  • Yes, so the question is "is there a way to preserve the name same way using libcurl" ? Thanks for all the help. – bladeWalker Sep 02 '14 at 19:17
  • And again, *you're* the one asking for the file, and *you* already have the name. If you want to write it to a file of the same name then create a file with the same name. If you're crawling a site and wanting to replicate the directory structure or what-not, *you* have to write the code to do that. There is no functionality in libcurl that will reap the name from the download URL and create it for you, but since *you* are providing the URL, you already have it, and since *you* open the `FILE*` to write to, I don't see where the hang-up is. `curl` is not `libcurl`. `curl` *uses* `libcurl`. – WhozCraig Sep 02 '14 at 19:21
  • Take a example ""http://get.adobe.com/reader/download/?installer=Reader_10.1_Basque_for_Windows&standalone=1%22"". Here programmer do not know the name of target file, which wget saves with name "AdbeRdr1010_eu_ES.exe" – bladeWalker Sep 02 '14 at 19:24
  • 1
    Ok, so the core question then is how does `wget`, `curl`, etc, reap that file name when not specified as part of the URL, and how can *you* do the same? Is that the crux? The directory hierarchy you're going to have to manage yourself, but the unspecified filename is a different, and completely understandable question (you're last link is an excellent example). – WhozCraig Sep 02 '14 at 19:27
  • Yes, that is exactly the question. Sorry for confusion. I should have put it better. – bladeWalker Sep 02 '14 at 19:33
  • Its much clearer now, you can probably summarize much of the question text to that, but keep the code, as its an excellent starting point for someone to help you. The side-by-side of a "this is easy, since I have the filename: , but how is *this* done*: . I'll poke around, but there are probably some strong libcurl/web guys that know a solution pretty well, so hopefully an answer surfaces (i'm genuinely curious myself now). – WhozCraig Sep 02 '14 at 19:37
  • 2
    A little checking on the URL you provided was enlightening. The amount of downlaod contenti interesting because it isn't just a GET and a pull. Loaded in chrome with debugging, the result is slurry of 39 subsequent requests launched from the initial downloaded page, including a plethora of java-script, and eventually finishes with a final GET that includes this: `http://ardownload.adobe.com/pub/adobe/reader/win/10.x/10.1.0/eu_ES/AdbeRdr1010_eu_ES.exe` as the `Request URL`. How `wget` and `curl` manage to shield you from all of this *and* work is impressive. – WhozCraig Sep 02 '14 at 20:15
  • @WhozCraig please check the solution I just added. This concludes our discussion. – bladeWalker Sep 16 '14 at 20:54

2 Answers2

12

I found the answer in libcurl source code. Looks like "remote name" is part of the "content-disposition" tag from the header. Libcurl is parsing header and looking for "filename=" in the content-disposition tag. This parsing is done in callback provided through CURLOPT_HEADERFUNCTION option. Finally, in a callback for writing data (provided through CURLOPT_WRITEFUNCTION) this remote name is used to create output file.

If file name is missing, its simply figuring it out from URL itself. This is pretty much code copied from lib curl and little modifications of my own to make it simpler and match my requirement.

#define _GNU_SOURCE 
#include <stdio.h>
#include <curl/curl.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <stdlib.h>

typedef unsigned long uint64_t;
typedef struct {
    char        dnld_remote_fname[4096];
    char        dnld_url[4096]; 
    FILE        *dnld_stream;
    FILE        *dbg_stream;
    uint64_t    dnld_file_sz;
} dnld_params_t;

static int get_oname_from_cd(char const*const cd, char *oname)
{
    char    const*const cdtag   = "Content-disposition:";
    char    const*const key     = "filename=";
    int     ret                 = 0;
    char    *val                = NULL;

    /* Example Content-Disposition: filename=name1367; charset=funny; option=strange */

    /* If filename is present */
    val = strcasestr(cd, key);
    if (!val) {
        printf("No key-value for \"%s\" in \"%s\"", key, cdtag);
        goto bail;
    }

    /* Move to value */
    val += strlen(key);

    /* Copy value as oname */
    while (*val != '\0' && *val != ';') {
        //printf (".... %c\n", *val);
        *oname++ = *val++;
    }
    *oname = '\0';

bail:
    return ret;
}

static int get_oname_from_url(char const* url, char *oname)
{
    int         ret = 0;
    char const  *u  = url;

    /* Remove "http(s)://" */
    u = strstr(u, "://");
    if (u) {
        u += strlen("://");
    }

    u = strrchr(u, '/');

    /* Remove last '/' */
    u++;

    /* Copy value as oname */
    while (*u != '\0') {
        //printf (".... %c\n", *u);
        *oname++ = *u++;
    }
    *oname = '\0';

    return ret;
}

size_t dnld_header_parse(void *hdr, size_t size, size_t nmemb, void *userdata)
{
    const   size_t  cb      = size * nmemb;
    const   char    *hdr_str= hdr;
    dnld_params_t *dnld_params = (dnld_params_t*)userdata;
    char const*const cdtag = "Content-disposition:";

    /* Example: 
     * ...
     * Content-Type: text/html
     * Content-Disposition: filename=name1367; charset=funny; option=strange
     */
    if (strstr(hdr_str, "Content-disposition:")) {
        printf ("has c-d: %s\n", hdr_str);
    }

    if (!strncasecmp(hdr_str, cdtag, strlen(cdtag))) {
        printf ("Found c-d: %s\n", hdr_str);
        int ret = get_oname_from_cd(hdr_str+strlen(cdtag), dnld_params->dnld_remote_fname);
        if (ret) {
            printf("ERR: bad remote name");
        }
    }

    return cb;
}

FILE* get_dnld_stream(char const*const fname)
{
    char const*const pre = "/tmp/";
    char out[4096];

    snprintf(out, sizeof(out), "%s/%s", pre, fname);

    FILE *fp = fopen(out, "wb");
    if (!fp) {
        printf ("Could not create file %s\n", out);
    }

    return fp;
}

size_t write_cb(void *buffer, size_t sz, size_t nmemb, void *userdata)
{
    int ret = 0;
    dnld_params_t *dnld_params = (dnld_params_t*)userdata;

    if (!dnld_params->dnld_remote_fname[0]) {
        ret = get_oname_from_url(dnld_params->dnld_url, dnld_params->dnld_remote_fname);
    }

    if (!dnld_params->dnld_stream) {
        dnld_params->dnld_stream = get_dnld_stream(dnld_params->dnld_remote_fname);
    }

    ret = fwrite(buffer, sz, nmemb, dnld_params->dnld_stream);
    if (ret == (sz*nmemb)) {
       dnld_params->dnld_file_sz += ret;
    }
    return ret;
}


int download_url(char const*const url)
{
    CURL        *curl;
    int         ret = -1;
    CURLcode    cerr = CURLE_OK;
    dnld_params_t dnld_params;

    memset(&dnld_params, 0, sizeof(dnld_params));
    strncpy(dnld_params.dnld_url, url, strlen(url));

    curl = curl_easy_init();
    if (!curl) {
        goto bail;
    }

    cerr = curl_easy_setopt(curl, CURLOPT_URL, url);
    if (cerr) { printf ("%s: failed with err %d\n", "URL", cerr); goto bail;}

    cerr = curl_easy_setopt(curl, CURLOPT_HEADERFUNCTION, dnld_header_parse);
    if (cerr) { printf ("%s: failed with err %d\n", "HEADER", cerr); goto bail;}

    cerr = curl_easy_setopt(curl, CURLOPT_HEADERDATA, &dnld_params);
    if (cerr) { printf ("%s: failed with err %d\n", "HEADER DATA", cerr); goto bail;}

    cerr = curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, write_cb);
    if (cerr) { printf ("%s: failed with err %d\n", "WR CB", cerr); goto bail;}

    cerr = curl_easy_setopt(curl, CURLOPT_WRITEDATA, &dnld_params);
    if (cerr) { printf ("%s: failed with err %d\n", "WR Data", cerr); goto bail;}


    cerr = curl_easy_perform(curl);
    if(cerr != CURLE_OK) {
        fprintf(stderr, "curl_easy_perform() failed: %s\n", curl_easy_strerror(cerr));
    }

    printf ("Remote name: %s\n", dnld_params.dnld_remote_fname);
    fclose(dnld_params.dnld_stream);

    /* always cleanup */
    curl_easy_cleanup(curl);
    ret = 0;
    printf ("file size : %lu\n", dnld_params.dnld_file_sz);

bail:
    return ret;
}

int main(int argc, char *argv[])
{
    if (argc != 2) {
        printf ("Bad args\n");
        return -1;
    }
    return download_url(argv[1]);
}
bladeWalker
  • 978
  • 1
  • 13
  • 30
  • nice work ... but the code that gets the filename from the Content-disposition header should respect the quoted-string syntax as of RFC2616 - https://tools.ietf.org/html/rfc2616#section-2.2 – Sebastian Jul 01 '20 at 20:44
  • Regarding [`strcasestr`](https://stackoverflow.com/q/9935642/2436175) – Antonio Apr 24 '23 at 10:06
-2

It's your program, not libcurl that is determining the filename. In your example, you could simple change char *location = "/tmp/test/out"; to char *location = "/tmp/test/vimqrc.pdf"; to get your desired effect.

If you want to derive the download file path programatically given a url and parent directory, you could do something like he following :

int url_to_location(char* location, unsigned int location_length, const char* url, const char* parent_directory)
{
    //char location[MAX_PATH];
    //const char *url = "http://tnerual.eriogerg.free.fr/vimqrc.pdf";
    //const char *parent_directory = "/tmp/test/";

    int last_slash_index = -1;
    int current_index = (int)strlen(url);
    while (current_index >= 0)
    {
        if (url[current_index] == '/')
        {
            last_slash_index = current_index;
            break;
        }
        current_index--;
    }
    unsigned int parent_directory_length = strlen(parent_directory)
    if (parent_directory_length <= location_length)
        return -1;
    strcpy(location, parent_directory);
    if (last_slash_index == -1) //no slashes found, use relative url as filename
    {
        if (parent_directory_length + strlen(url) <= location_length)
           return -1;

        strcat(location, url);
    }
    else    //use the characters of the url following the last slash as filename
    {
        if (parent_directory_length + strlen(url + last_slash_index + 1) <= location_length)
           return -1;

        strcat(location, url + last_slash_index + 1);
    }
    return strlen(location);
}
notchahm
  • 1
  • 2
  • @notcham, the url in program is just for example. I am getting url as input to the program, which could be above or any other url (e.g. http://www.vim.org/ugrankar.pdf). This could also be dynamically generated URL. – bladeWalker Sep 02 '14 at 19:06
  • My example code was meant to handle those cases. To make it clearer, I've modified the code, removing the hard-coded values and formatting as a function that you can pass any url and parent directory to generate the location to write on the local system – notchahm Sep 03 '14 at 01:16
  • I updated the question with more examples. The url does not necessarily always have the target file name. It could be dynamically generated url. wget and curl utilities somehow figure it out. Also read comments from WhozCraig above. – bladeWalker Sep 03 '14 at 06:02