I am working on a simple client-server project. Client is written in Java, it sends key words to C++ server written under Linux and recives a list of URLs with best ranks ( depending on number of occurrences of key words ). Server's job is to go through some URLs in search of key words and return best-fitting URLs. And now the problem is that I have to parse HTML sites to find occurrences of key words, plus I need to extract links from visited page to search on them as well. And my question is what library can I use to do that? Remember only C++ linux libraries are suitable for me. There were some similar topics, so I tried to go through most of them, but some of libraries parse only html files and I don't want to download every site I visit, but parse it on the fly and just store it's rank and url. Some of them look a bit complicated to me - for instance firstly parsing HTML to XML or something else and then finally work on the results with C++. Is there something simple and sufficient to do what I need it to do? Any advise will be appreciated.
3 Answers
I don't think regular expressions are appropriate for HTML parsing. I'm using libxml2, and I enjoy it very much - easy to use, portable and lightning fast.

- 32,368
- 48
- 194
- 335
-
1. There are a lot of implementations of this libarary on their site. Could you please give me a link to one, which is for C++? I am not really familiar with stuff like that. 2. You said you were using it, do you maybe have a snippet of code, that I could use as an example? ( especially if it is used with C++) Thanks ! – koleS Nov 24 '11 at 20:46
-
@koleS: Frankly, I'm new to it myself. I have downloaded snapshot: [link](ftp://xmlsoft.org/libxml2/libxml2-git-snapshot.tar.gz). I had no problem building it on Windows, and I think building for other platforms is no different. – Violet Giraffe Nov 24 '11 at 21:03
To get URLs from the web using C/C++ you could use the libcurl library. To parse URLs and other not too easy stuff from the site you can use a regex library.
Separating the HTML tags from the real content can also be done without the use of a library.
For more advanced stuff one could use Qt which offers classes such as QWebPage (which uses WebKit) that allows one to access the DOM-Model of the page and extract individual HTML objects (e.g. single cells of a table) rather easyly.
You can try xerces-c. It's a powerful library for xml parsing. It support xml reading on the fly, dom and sax parsing.

- 8,603
- 5
- 45
- 63
-
Strictly xml-oriented parser will not eat most pages from the Web. And don't see anything about html supporton their page. – Violet Giraffe Nov 24 '11 at 20:18