0

I'm trying to save a full webpage to a .txt file with C++ (Visual Studio 2013). I'm using cURL. Everything works fine, but the website I'm trying to save - uses a lot of javascript to generate the page. So when I save the webpage with cURL - the .txt file has only ~170 lines. When I save the webpage with Google Chrome (ctrl+s) to .htm file - the .htm file has over 2000 lines. Is there any way to save a fully-loaded webpage to a file? This is the code I'm using:

struct MemoryStruct {
    char *memory;
    size_t size;
};

static size_t
WriteMemoryCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
    size_t realsize = size * nmemb;
    struct MemoryStruct *mem = (struct MemoryStruct *)userp;

    mem->memory = (char*)realloc(mem->memory, mem->size + realsize + 1);
    if (mem->memory == NULL) {
        /* out of memory! */
        printf("not enough memory (realloc returned NULL)\n");
        return 0;
    }

    memcpy(&(mem->memory[mem->size]), contents, realsize);
    mem->size += realsize;
    mem->memory[mem->size] = 0;

    return realsize;
}


int main(void)
{
    CURL *curl_handle;
    CURLcode res;

    struct MemoryStruct chunk;

    chunk.memory = (char*)malloc(1);  /* will be grown as needed by the realloc above */
    chunk.size = 0;    /* no data at this point */

    curl_global_init(CURL_GLOBAL_ALL);

    /* init the curl session */
    curl_handle = curl_easy_init();

    /* specify URL to get */
    curl_easy_setopt(curl_handle, CURLOPT_URL, "http://www.example.com/");

    /* send all data to this function  */
    curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, WriteMemoryCallback);

    /* we pass our 'chunk' struct to the callback function */
    curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, (void *)&chunk);

    /* some servers don't like requests that are made without a user-agent
    field, so we provide one */
    curl_easy_setopt(curl_handle, CURLOPT_USERAGENT, "libcurl-agent/1.0");

    /* get it! */
    res = curl_easy_perform(curl_handle);

    /* check for errors */
    if (res != CURLE_OK) {
        fprintf(stderr, "curl_easy_perform() failed: %s\n",
            curl_easy_strerror(res));
    }
    else {
        /*
        * Now, our chunk.memory points to a memory block that is chunk.size
        * bytes big and contains the remote file.
        *
        * Do something nice with it!
        */

        printf("%lu bytes retrieved\n", (long)chunk.size);
    }
    std::ofstream oplik;
    oplik.open("test.txt");
    oplik << chunk.memory;
    oplik.close();

    /* cleanup curl stuff */
    curl_easy_cleanup(curl_handle);

    if (chunk.memory)
        free(chunk.memory);

    /* we're done with libcurl, so clean it up */
    curl_global_cleanup();

    return 0;
}

Thanks for help, and sorry for my bad English.

Mona
  • 337
  • 3
  • 15

1 Answers1

1

cURL can only save what is delivered by the web server.

If you want to save anything beyond that, you must include a javascript interpreter to build the web page as any web browser does.

Olaf Dietsche
  • 72,253
  • 8
  • 102
  • 198
  • I have no idea how to do this. Isn't there any easier way just to open the webpage as Internet Explorer, and then get generated data? – Mona Feb 17 '14 at 05:59
  • I don't know either, because I am not familiar with Windows or IE. But I can imagine, there is some Component, which allows this. Otherwise, you could look into [embed V8](https://developers.google.com/v8/embed) or http://stackoverflow.com/q/93692/1741542 – Olaf Dietsche Feb 17 '14 at 08:02