Programmatically reading a web page

Question

I want to write a program in C/C++ that will dynamically read a web page and extract information from it. As an example imagine if you wanted to write an application to follow and log an ebay auction. Is there an easy way to grab the web page? A library which provides this functionality? And is there an easy way to parse the page to get the specific data?

VERY difficult in C/C++. Its annoying enough even in languages that have extensive support for regular expressions, XML parsing, HTTP methods, etc (eg Java). As for Ebay it has an API you should use. — cletus, Dec 23 '08 at 15:03

score 44 · Accepted Answer · edited Dec 23 '08 at 15:47

44

Have a look at the cURL library:

 #include <stdio.h>
 #include <curl/curl.h>

 int main(void)
 {
   CURL *curl;
   CURLcode res;

   curl = curl_easy_init();
   if(curl) {
     curl_easy_setopt(curl, CURLOPT_URL, "curl.haxx.se");
     res = curl_easy_perform(curl);
      /* always cleanup */
    curl_easy_cleanup(curl);
   }
   return 0;
 }

BTW, if C++ is not strictly required. I encourage you to try C# or Java. It is much easier and there is a built-in way.

edited Dec 23 '08 at 15:47

philant

34,748
11
69
112

answered Dec 23 '08 at 15:05

Gant

29,661
6
46
65

6

+1 for cURL - I've used cURL in one of my C++ applications and it works great, even with proxies and all other obstacles you might encounter. – BlaM Dec 23 '08 at 15:37
2

It would be better to return an error if curl is null (in above example). – Matthew Flaschen Dec 23 '08 at 23:27
Check out curlpp - C++ wrapper for cURL library – Piotr Dobrogost May 07 '09 at 11:08
Thumbs up for suggesting C# or Java. Python is even easier, particularly if you have the Beautiful Soup package installed to help with the parsing. – Mike Housky Dec 27 '18 at 22:17
Seems like `if (!curl) return 1;` would make more sense, but I guess that's a nit – monokrome Mar 23 '19 at 03:34
1

Why is this +1'd and chosen as the answer? Where's the actual document? What does the code do? Blatant copy and paste. – Chloe Dev Dec 04 '20 at 18:41
it should be noted this library does not involve any parsing of what you download, which is the "hard" part of this problem, it only lets you perform the download request. cURL will get you as close to an ebay auction logger, as Nicolaus Copernicus got NASA to the lunar landing. – Anne Quinn Aug 05 '21 at 19:13

score 16 · Answer 2 · answered Sep 11 '12 at 16:54

Windows code:

#include <winsock2.h>
#include <windows.h>
#include <iostream>
#pragma comment(lib,"ws2_32.lib")
using namespace std;
int main (){
    WSADATA wsaData;
    if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0) {
        cout << "WSAStartup failed.\n";
        system("pause");
        return 1;
    }
    SOCKET Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
    struct hostent *host;
    host = gethostbyname("www.google.com");
    SOCKADDR_IN SockAddr;
    SockAddr.sin_port=htons(80);
    SockAddr.sin_family=AF_INET;
    SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);
    cout << "Connecting...\n";
    if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0){
        cout << "Could not connect";
        system("pause");
        return 1;
    }
    cout << "Connected.\n";
    send(Socket,"GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n", strlen("GET / HTTP/1.1\r\nHost: www.google.com\r\nConnection: close\r\n\r\n"),0);
    char buffer[10000];
    int nDataLength;
    while ((nDataLength = recv(Socket,buffer,10000,0)) > 0){        
        int i = 0;
        while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r') {
            cout << buffer[i];
            i += 1;
        }
    }
    closesocket(Socket);
        WSACleanup();
    system("pause");
    return 0;
}

Be careful when posting copy and paste boilerplate/verbatim answers to multiple questions, these tend to be flagged as "spammy" by the community. If you're doing this then it usually means the questions are duplicates so flag them as such instead: http://stackoverflow.com/a/12374407/419 — Kev, Sep 11 '12 at 23:14
This code has serious flaws: 1) It the page is more than 10,000 bytes without non-printable characters, it will read past the end of buffer and seg-fault. 2) If the webpage has a TAB character in it (or other non-printable characters), this code will skip forward up to 10,000 bytes. 3) New code shouldn't use `gethostbyname()`. It should use `getaddrinfo()` and support IPv4 and IPv6. — Imbue, Dec 04 '18 at 19:27
The inner while loop can be replaced by `printf("%.*s", nDataLength, buffer);` which is easier, faster, and safer. — Imbue, Dec 04 '18 at 19:28

score 4 · Answer 3 · answered Dec 23 '08 at 18:13

There is a free TCP/IP library available for Windows that supports HTTP and HTTPS - using it is very straightforward.

Ultimate TCP/IP

CUT_HTTPClient http;
http.GET("http://folder/file.htm", "c:/tmp/process_me.htm");

You can also GET files and store them in a memory buffer (via CUT_DataSource derived classes). All the usual HTTP support is there - PUT, HEAD, etc. Support for proxy servers is a breeze, as are secure sockets.

score 3 · Answer 4 · edited Sep 23 '16 at 19:26

3

Try using a library, like Qt, which can read data from across a network and get data out of an xml document. This is an example of how to read an xml feed. You could use the ebay feed for example.

edited Sep 23 '16 at 19:26

Alexander Smirnov

388
1
11

answered Dec 23 '08 at 15:10

Marius

57,995
32
132
151

score 2 · Answer 5 · answered Dec 23 '08 at 15:06

2

You can do it with socket programming, but it's tricky to implement the parts of the protocol needed to reliably fetch a page. Better to use a library, like neon. This is likely to be installed in most Linux distributions. Under FreeBSD use the fetch library.

For parsing the data, because many pages don't use valid XML, you need to implement heuristics, not a real yacc-based parser. You can implement these using regular expressions or a state transition machine. As what you're trying to do involves a lot of trial-and-error you're better off using a scripting language, like Perl. Due to the high network latency you will not see any difference in performance.

answered Dec 23 '08 at 15:06

Diomidis Spinellis

18,734
5
61
83

While they aren't valid XML, many languages have libraries that have HTML parsers, which will let you use a DOM interface to parse an HTML document. – Daniel Papasian Dec 23 '08 at 15:59
Yes, neon is nice too (but most of my experience is with curl, as mentioned in m3rLinEz's answer. Any comparison somewhere? – bortzmeyer Dec 23 '08 at 22:27

score 2 · Answer 6 · answered Dec 30 '08 at 16:58

You're not mentioning any platform, so I give you an answer for Win32.

One simple way to download anything from the Internet is the URLDownloadToFile with the IBindStatusCallback parameter set to NULL. To make the function more useful, the callback interface needs to be implemented.

score 2 · Answer 7 · answered Apr 24 '21 at 17:12

It can be done in Multiplatform QT library:

QByteArray WebpageDownloader::downloadFromUrl(const std::string& url)
{
    QNetworkAccessManager manager;
    QNetworkReply *response = manager.get(QNetworkRequest(QUrl(url.c_str())));
    QEventLoop event;
    QObject::connect(response, &QNetworkReply::finished, &event, &QEventLoop::quit);
    event.exec();
    return response->readAll();
}

That data can be e.g. saved to file, or transformed to std::string:

const string webpageText = downloadFromUrl(url).toStdString();

Remember that you need to add

QT       += network

to QT project configuration to compile the code.

Programmatically reading a web page

7 Answers7

Linked

Related