extract text from a webpage file using C/C++

Question

How to extract text from a specific area of a webpage (in Arabic not English) given the url using C/C++?

For example: given the url of this wikipedia article I want to extract the body of the article (highlighted in the image below) and throw away the other parts of the webpage like the heading, the menus on the right and on the left, etc. I only need the body to be parsed into a string.

example image

Expanding on the "use cURL" comment-and-link: of course no library or tool will work "out of the box". cURL allows easy downloading of your page. Cleaning the HTML of anything you don't need is something you need to write. — Jongware, Apr 09 '14 at 18:07
As a note, there is no language called C/C++. Modern C and modern C++ are very different languages and the idiomatic solution for one may not work in the other. Unless you are asking to compare/contrast in some way, tag only the language you are actually using/compiling. — crashmstr, Apr 09 '14 at 18:46
@crashmstr thanks but slash "/" means "or". I meant c or c++. — CSawy, Apr 12 '14 at 23:32

score 1 · Accepted Answer · answered Apr 09 '14 at 18:37

1

To get only the article text from a Wikipedia page, add ?action=render to your url.

Then use e.g. curl to fetch it. Search the web for curl/c++ tutorials if you don't know how. You are looking for something like this (just to give you an idea):

#include <stdio.h>
#include <curl/curl.h>

int main(void) {

    CURL* curl;
    CURLcode result;

    curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, "https://ar.wikipedia.org/wiki/%D8%B3%D9%8A_%D8%A5%D9%86_%D8%A5%D9%86_%D8%A7%D9%84%D8%B9%D8%B1%D8%A8%D9%8A%D8%A9?action=render");

    result = curl_easy_perform(curl);

    curl_easy_cleanup(curl);

    return 0;
}

answered Apr 09 '14 at 18:37

leo

8,106
7
48
80

in the above program, which variable/object holds the content of the webpage? – Santhosh Kumar Nov 03 '16 at 08:23
I don't know a lot of C++, but I think you can provide a callbackfunction like this: `curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, myWriteFunction);` – leo Nov 03 '16 at 13:50

extract text from a webpage file using C/C++

1 Answers1